DeepAI
Log In Sign Up

Relational Graph Attention Networks

04/11/2019
by   Dan Busbridge, et al.
12

We investigate Relational Graph Attention Networks, a class of models that extends non-relational graph attention mechanisms to incorporate relational information, opening up these methods to a wider variety of problems. A thorough evaluation of these models is performed, and comparisons are made against established benchmarks. To provide a meaningful comparison, we retrain Relational Graph Convolutional Networks, the spectral counterpart of Relational Graph Attention Networks, and evaluate them under the same conditions. We find that Relational Graph Attention Networks perform worse than anticipated, although some configurations are marginally beneficial for modelling molecular properties. We provide insights as to why this may be, and suggest both modifications to evaluation strategies, as well as directions to investigate for future work.

READ FULL TEXT VIEW PDF

page 3

page 16

page 19

page 20

page 21

02/13/2021

Bridging Graph Neural Networks and Statistical Relational Learning: Relational One-Class GCN

We consider the problem of learning Graph Convolutional Networks (GCNs) ...
07/21/2021

Relational Graph Convolutional Networks: A Closer Look

In this paper, we describe a reproduction of the Relational Graph Convol...
11/01/2018

SARN: Relational Reasoning through Sequential Attention

This paper proposes an attention module augmented relational network cal...
10/12/2022

Relational Graph Convolutional Neural Networks for Multihop Reasoning: A Comparative Study

Multihop Question Answering is a complex Natural Language Processing tas...
03/17/2017

Modeling Relational Data with Graph Convolutional Networks

Knowledge graphs enable a wide variety of applications, including questi...
01/06/2022

Skip Vectors for RDF Data: Extraction Based on the Complexity of Feature Patterns

The Resource Description Framework (RDF) is a framework for describing m...
09/14/2022

Efficient multi-relational network representation using primes

Multi-relational networks play an important role in today's world and ar...

Code Repositories

rgat

A TensorFlow implementation of Relational Graph Attention Networks, paper: https://arxiv.org/abs/1904.05811


view repo

rgat

A TensorFlow implementation of Relational Graph Attention Networks, paper: https://arxiv.org/abs/1904.05811


view repo

1 Introduction

cnn successfully solve a variety of tasks in Euclidean grid-like domains, such as image captioning (Donahue et al., 2017)

and classifying videos

(Karpathy et al., 2014). cnn are successful because they assume the data is locally stationary and compositional (Defferrard et al., 2016; Henaff et al., 2015; Bruna et al., 2013).

However, data often occurs in the form of graphs or manifolds, which are classic examples of non-Euclidean domains. Specific instances include knowledge bases, molecules, and point clouds captured by 3D data acquisition devices (Wang et al., 2018). The generalisation of nn to non-Euclidean domains is termed gdl, and may be roughly divided into spectral, spatial and hybrid approaches (Bronstein et al., 2017).

Spectral approaches (Defferrard et al., 2016), most notably gcn (Kipf and Welling, 2016), are limited by their basis-dependence. A filter that is learned with respect to a basis on one domain is not guaranteed to behave similarly when applied to another basis and domain. Spatial approaches are limited by an absence of shift invariance and lack of coordinate system (Duvenaud et al., 2015; Atwood and Towsley, 2016; Monti et al., 2017). Hybrid approaches combine spectral and spatial approaches, trading their advantages and deficiencies against each-other (Bronstein et al., 2017; Rustamov and Guibas, 2013; Szlam et al., 2005; Gavish et al., 2010).

A recent approach that began with gat, applied attention mechanisms to graphs, and does not share these limitations (Veličković et al., 2017; Gong and Cheng, 2018; Zhang et al., 2018; Monti et al., 2018; Lee et al., 2018).

An alternative direction has been to generalise rnn from sequential message passing on one-dimensional signals, to message passing on graphs (Sperduti and Starita, 1997; Frasconi et al., 1997; Gori et al., 2005). Incorporating gating mechanisms led to the development of ggnn (Scarselli et al., 2009; Allamanis et al., 2017)111We note that ggnn support relation types. Evaluating these models on the tasks presented here is necessary to acquire a better understanding neural models of relational data..

rgcn have been proposed as an extension of gcn to the domain of relational graphs (Schlichtkrull et al., 2018). This model has achieved impressive performance on node classification and link prediction tasks, however, its mechanisms still resides within spectral methods and shares their deficiencies. The focus of this work investigate generalisations of rgcn away from its spectral origins.

We take rgcn as a starting point, and investigate a class of models we term rgat, extending attention mechanisms to the relational graph domain. We consider two variants, wirgat and argat, each with either additive or multiplicative attention. We perform an extensive hyperparameter search, and evaluate these models on challenging transductive node classification and inductive graph classification tasks. These models are compared against established benchmarks, as well as a re-tuned rgcn model.

We show that rgat performs worse than expected, although some configurations produce marginal benefits on inductive graph classification tasks. In order to aid further investigation in this direction, we present the full cdf for the hyperparameter searches in Appendix D

, and statistical hypothesis tests in

Appendix E. We also provide a vectorised, sparse, batched implementation of rgat and rgcn in TensorFlow which is compatible with eager execution mode to open up research into these models to a wider audience222https://github.com/Babylonpartners/rgat..

2 RGAT architecture

2.1 Relational graph attention layer

We follow the construction of the gat layer in Veličković et al. (2017), extending to the relational setting, using ideas from Schlichtkrull et al. (2018).

Layer input and output

The input to the layer is a graph with relation types and nodes. The

  node is represented by a feature vector

, and the features of all nodes are summarised in the feature matrix . The output of the layer is the transformed feature matrix , where is the transformed feature vector of the   node.

Intermediate representations

Different relations convey distinct pieces of information. The update rule of Schlichtkrull et al. (2018) made this manifest by assigning each node a distinct intermediate representation under relation

(1)

where is the intermediate representation feature matrix under relation , and

are the learnable parameters of a shared linear transformation.

Logits

Following Veličković et al. (2017); Zhang et al. (2018), we assume the attention coefficient between two nodes is based only on the features of those nodes up to a neighborhood-level normalisation. To keep computational complexity linear in , we assume that, given linear transformations

, the logits

of each relation are independent

(2)

and indicate the importance of node ’s intermediate representation to that of node under relation . The attention is masked so that, for node , coefficients exist only for , where denotes the set of neighbor indices of node under relation .

Figure 1: wirgat. The intermediate representations for node (left red rectangle) are combined with the intermediate representations for nodes in its neighborhood (blue rectangles) under each relation , to form each logit . A softmax is taken over each logit matrix for each relation type to form the attention coefficients . These attention coefficients construct a weighted sum over the nodes in the neighborhood for each relation (black rectangle). These are then aggregated and passed through a nonlinearity to produce the updated representation for node (right red rectangle).
Queries, keys and values

The logits are composed from queries and keys, and specify how the values, i.e. the intermediate representations , will combine to produce the updated node representations (Vaswani et al., 2017). A separate query kernel and key kernel project the intermediate representations , into query and key representations of dimensionality

(3)

For convenience, the query and key kernels are combined to form the attention kernels . These query and key representations are the building blocks of the two specific realisations of in Equation 2 that we now consider.

Additive attention logits

The first realisation of we consider is the relational modification of the logit mechanism of Veličković et al. (2017)

(4)

where the query and key dimensionality are both , and and are scalar flattenings of their one-dimensional vector counterparts . We refer to any instance of rgat using logits of the form in Equation 4 as additive rgat.

Multiplicative attention logits

The second realisation we consider is the multiplicative mechanism of Vaswani et al. (2017); Zhang et al. (2018)333The form of our mechanism is not precisely that of Zhang et al. (2018) as they also consider residual concatenation and gating mechanism applied across the heads of the attention mechanism.

(5)

where the query and key dimensionality can be any positive integer. We refer to any instance of rgat using logits of the form in Equation 4 as multiplicative rgat.

It should be noted that there are many types of attention mechanisms beyond vanilla additive and multiplicative. These include mechanisms leveraging the structure of the dual graph Monti et al. (2018) as well as learned edge features Gong and Cheng (2018).

The attention coefficients should be comparable across nodes. This can be achieved by applying softmax appropriately to any logits . We investigate two candidates, each encoding a different prior belief about how the importance of different relations.

wirgat

The simplest way to take the softmax over the logits of Equation 4 or Equation 5 is to do so independently for each relation

(6)

We call the attention in Equation 6 wirgat (wirgat), and it is shown in Figure 1

. This mechanism encodes the prior that relation importance is a purely global property of the graph by implementing an independent probability distribution over nodes in the neighborhood of

for each relation . Explicitly, for any node and relation , nodes yield competing attention coefficients and with sizes depending on their corresponding representations and . There is no competition between any attention coefficients and for all nodes and nodes where irrespective of node representations.

Figure 2: argat. The logits are produced identically to those in Figure 1. A softmax is taken across all logits independent of relation type to form the attention coefficients . The remaining weighting and aggregation steps are the same as those in Figure 1.
argat

An alternative way to take the softmax over the logits of Equation 4 or Equation 5 is across node neighborhoods irrespective of relation

(7)

We call the attention in Equation 7 Equation 6 argat (argat), and it is shown in Figure 2. This mechanism encodes the prior that relation importance is a local property of the graph by implementing a single probability distribution over the different representations for nodes j in the neighborhood of node . Explicitly, for any node and all , all nodes and yield competing attention coefficients and with sizes depending on their corresponding representations and .

Comparison to rgcn

For comparison, the coefficients of rgcn are given by . This encodes the prior that the intermediate representations of nodes to node under relation are equally important.

Propagation rule

Combining the attention mechanism of either Equation 6 or Equation 7 with the neighborhood aggregation step of Schlichtkrull et al. (2018) gives

(8)

where represents an optional nonlinearity. Similar to Vaswani et al. (2017); Veličković et al. (2017), we also find that using multiple heads in the attention mechanism can enhance performance

(9)

where denotes vector concatenation, are the normalised attention coefficients under relation computed by either wirgat or argat, and is the head specific intermediate representation of node under relation .

It might be interesting to consider cases where there are a different number of heads for different relationship types, as well as when a mixture of argat and wirgat produce the attention coefficients, however, we leave that subject for future investigation and will not consider it further.

Basis decomposition

The number of parameters in the rgat layer increases linearly with the number of relations and heads , and can lead quickly to overparameterisation. In rgcn it was found that decomposing the kernels was beneficial for generalisation, although it comes at the cost of increased model bias (Schlichtkrull et al., 2018). We follow this approach, decomposing both the kernels as well as the kernels of attention mechanism into basis matrices and basis vectors

(10)

where are basis coefficients. We consider models using full and decomposed and .

2.2 Node classification

Input

RGAT concat

RGAT mean

Node loss
(a) Node classification.

Input

RGAT concat

RGAT concat

Graph gather

Dense

Dense

Graph loss
(b) Multi-task graph classification.
Figure 3: (a) The network architecture used for node classification on AIFB and MUTAG. This architecture is the same as in Schlichtkrull et al. (2018) except with rgcn replaced with rgat. (b) The network architecture used for multi-task graph classification on Tox21. This architecture is the same as the gcn architecture in Altae-Tran et al. (2016) except with rgcn replaces with gat and we do not use graph pooling.

For the transductive task of semi-supervised node classification, we employ a two-layer rgat architecture shown in Figure 2(a)

. We use a relu activation after the rgat concat layer, and a node-wise softmax on the final layer to produce an estimate for the probability that the

  label is in the class

(11)

We then employ a masked cross-entropy loss to constrain the network updates to the subset of nodes whose labels are known

(12)

where is the one-hot representation of the true label for node .

2.3 Graph classification

For inductive graph classification, we employ a two-layer rgat followed by a graph gather and dense network architecture shown in Figure 2(b). We use relu activations after each rgat layer and the first dense layer. We use a activation after the , which is a vector concatenation of the mean of the node representations with the feature-wise of the node representations

(13)

The final dense layer then produces logits of the size , and we apply a task-wise softmax to its output to produce an estimate for the probability that the graph is in class for a given task , analogous to Equation 11. Weighted cross-entropy loss is then used to form the learning objective

(14)

where and are the weights and one-hot true labels for task and class respectively.

3 Evaluation

3.1 Datasets

We evaluate the models on transductive and inductive tasks. Following the experimental setup of Schlichtkrull et al. (2018) for the transductive tasks, we evaluate our model on the rdf datasets AIFB and MUTAG. We also evaluate our model for an inductive task on the molecular dataset, Tox21. Details of these data sets are given in Table 1. For further details on the transductive and inductive datasets, please see Ristoski and Paulheim (2016) and Wu et al. (2018) respectively.

Datasets AIFB MUTAG Tox21
Task Transductive Transductive Inductive
Nodes 8,285 (1 graph) 23,644 (1 graph) 145,459 (8014 graphs)
Edges 29,043 74,227 151,095
Relations 45 23 4
Labelled 176 340 96,168 (12 per graph)
Classes 4 2 12 (multi-label)
Train nodes 112 218 (6411 graphs)
Validation nodes 28 54 (801 graphs)
Test nodes 28 54 (802 graphs)
Table 1: A summary of the datasets used in our experiments and how they are partitioned.
Transductive baselines

We consider as a baseline the recent state-of-the-art results from Schlichtkrull et al. (2018) obtained with a two-layer RGCN model with 16 hidden units and basis function decomposition. We also include the same challenging baselines of FEAT (Paulheim and Fümkranz, 2012), WL (Shervashidze et al., 2011; de Vries and de Rooij, 2015) and RDF2Vec (Ristoski and Paulheim, 2016). In-depth details of these baselines are given by Ristoski and Paulheim (2016).

Inductive baselines

As baselines for Tox21, we compare against the most competitive methods on Tox21 reported in Wu et al. (2018). Specifically, we compare against deep multitask networks Ramsundar et al. (2015), deep bypass multitask networks Wu et al. (2018), Weave Kearnes et al. (2016), and a rgcn model whose relational structure is determined by the degree of the node to be updated Altae-Tran et al. (2016). Specifically, up to and including some maximum degree ,

(15)

where is a degree-specific linear transformation for self-connections, is a degree-specific linear transformation for neighbours into their intermediate representations , and is a degree-specific bias. Any update for any degree gets assigned to the update for the maximum degree .

3.2 Experimental setup

Transductive learning

For the transductive learning tasks, the architecture discussed in Section 2.2 was applied. Its hyperparameters were optimised for both AIFB and MUTAG on their respective training/validation sets defined in Ristoski and Paulheim (2016)

, using 5-fold cross validation. Using the found hyperparameters, we retrain on the full training set and report results on the test set across 200 seeds. We employ early stopping on the validation set during cross-validation to determine the number of epochs we will run on the final training set. Hyperparameter optimisation details are given in

Table 4 of Appendix B.

Inductive learning

For the inductive learning tasks, the architecture discussed in Section 2.3 was applied. In order to optimise hyperparameters once, ensure no data leakage, but also provide comparable benchmarks to those presented in Wu et al. (2018), three benchmark splits were taken from the MolNet benchmarks444Retrieved from http://deepchem.io.s3-website-us-west-1.amazonaws.com/trained_models/Hyperparameter_MoleculeNetv3.tar.gz.

, and graphs belonging to any of the test sets were isolated. Using the remaining graphs we performed a hyperparameter search using 2 folds of 10-fold cross validation. Using the found hyperparameters, we then retrained on the three benchmark splits provided with 2 seeds each, giving an unbiased estimate of model performance. We employ early stopping during both the cross-validation and final run (the validation set of the inductive task is available for the final benchmark, in contrast to the transductive tasks) to determine the number of training epochs. Hyperparameter optimisation details are given in

Table 5 of Appendix B.

Constant attention

In all experiments, we train with the attention mechanism turned on. At evaluation time, however, we report results with and without the attention mechanism to provide insight into whether the attention mechanism helps. argat (wirgat) without the attention is called C-argat (C-wirgat).

3.3 Results

Model AIFB MUTAG
Feat
WL
RDF2Vec
RGCN
RGCN (ours)
Additive attention
C-wirgat
wirgat
C-argat
argat
Multiplicative attention
C-wirgat
wirgat
C-argat
argat
(a) Transductive
Model Tox21
Multitask
Bypass
Weave
RGCN
RGCN (ours)
Additive attention
C-wirgat
wirgat
C-argat
argat
Multiplicative attention
C-wirgat
wirgat
C-argat
argat
(b) Inductive
Table 2:

(a) Entity classification results accuracy (mean and standard deviation over 10 seeds) for FEAT

(Paulheim and Fümkranz, 2012), WL (Shervashidze et al., 2011; de Vries and de Rooij, 2015), RDF2Vec (Ristoski and Paulheim, 2016) and rgcn (Schlichtkrull et al., 2018), and (mean and standard deviation over 200 seeds) for our implementation of rgcn, as well as additive and multiplicative attention for (C-)wirgat and (C-)argat (this work). Test performance is reported on the splits provided in Ristoski and Paulheim (2016). (b) Graph classification mean roc auc across all 12 tasks (mean and standard deviation over 3 splits) for Multitask (Ramsundar et al., 2015), Bypass (Wu et al., 2018), Weave (Kearnes et al., 2016), rgcn (Altae-Tran et al., 2016), and (mean and standard deviation over 3 splits, 2 seeds per split) our implementation of rgcn, additive and multiplicative attention for (C-)wirgat and (C-)argat (this work). Test performance is reported on the splits provided in Wu et al. (2018). Best performance in class in boldened, and best performance overall is underlined. For completeness, we present the training and validation mean roc-auc alongside the test roc-auc in Appendix A. For a graphical representation of these results, see Figure 4 in Appendix C.

3.3.1 Benchmarks and additional analyses

Model means and standard deviations are presented in Table 2. To provide a picture of characteristic model behaviour, the cdf for the hyperparameter sweep are presented in Figure 5 of Appendix D. To draw meaningful conclusions, we compare against our own implementation of rgcn rather than the results reported in Schlichtkrull et al. (2018); Wu et al. (2018).

We will occasionally employ a one-sided hypothesis test in order to make concrete statements about model performance. The details and complete results of this test are presented in Appendix E

. When we refer to significant results this corresponds to a test statistic supporting our hypothesis with a

value .

3.3.2 Transductive learning

In Figure 3(a) we evaluate rgat on MUTAG and AIFB. With additive attention, wirgat outperforms argat, consistent with Schlichtkrull et al. (2018). Interestingly, when employing multiplicative attention, the converse appears true. For node classification tasks on rdf data, this indicates that the importance of a particular relation type does not vary much (if at all) across the graph unless one employs a multiplicative comparison555Or potentially other comparisons beyond additive or constant, i.e. rgcn. between node representations.

Aifb

On AIFB, the best to worst performing models are: 1) additive wirgat 2) multiplicative argat 3) rgcn 4) additive argat, and 5) multiplicative wirgat, with each comparison being significant.

When comparing against their constant attention counterparts, the significant differences observed were for additive and multiplicative argat, where attention gives a relative mean performance improvements of 1.03% and 0.31% respectively, and multiplicative wirgat, where attention gives a relative mean performance drop of 0.84%.

Although we present state-of-the art result on AIFB with additive wirgat, since its performance with and without attention are not significantly different, it is unlikely that this is due to the attention mechanism itself, at least at inference time. Over the hyperparameter space, additive wirgat and rgcn are comparable in performance (see Figure 4(a) in Appendix D), leading us to conclude that the result is more likely attributable to finding a better hyperparameter point for additive wirgat during the search.

Mutag

On MUTAG, the best to worst performing models are: 1) rgcn 2) multiplicative argat 3) additive wirgat tied with multiplicative wirgat, and 4) additive argat, with each comparison being significant.

When comparing against their constant attention counterparts, the significant differences observed were for additive wirgat and argat, where attention gives relative mean performance improvements of 0.66% and 2.90% respectively, and multiplicative argat, where attention gives a relative mean performance drop of 1.63%.

We note that rgcn consistently outperforms rgat on MUTAG, contrary to what might be expected (Schlichtkrull et al., 2018). The result is surprising given that rgcn lies within the parameter space of rgat (where the attention kernel is zero), a configuration we check through evaluating C-wirgat. In our experiments we have observed that both rgcn and rgat can memorise the MUTAG training set with accuracy without difficulty (this is not the case for AIFB). The performance gap between rgcn and rgat could then be explained by the following:

  • During training, the rgat layer uses its attention mechanism to solve the learning objective. Once the objective is solved, the model is not encouraged by the loss function to seek a point in the parameter space that would also behave well when attention is set to a normalising constant within neighbourhoods (i.e. the parameter space point that would be found by rgcn).

  • The rdf tasks are transductive, meaning that a basis-dependent spectral approach is sufficient to solve them. As rgcn already memorises the MUTAG training set, a model more complex666Measured in terms of mdl, for example. than rgcn, for example rgat, that can also memorise the training set is unlikely to generalise as well, although this is a hotly debated topic - see e.g. Zhang et al. (2016).

We employed a suite of regularisation techniques to get rgat to generalise on MUTAG, including L2-norm penalties, dropout in multiple places, batch normalisation, parameter reduction and early stopping, however, no evaluated harshly regularised points for rgat generalise well on MUTAG.

Our final observation is that the attention mechanism presented in Section 2.1 relies on node features. The node features for the above tasks are learned from scratch (the input feature matrix is a one-hot node index) as part of the task. It is possible that in this semi-supervised setup, there is insufficient signal in the data to learn both the input node embeddings as well as a meaningful attention mechanism to act upon them.

3.3.3 Inductive learning

In Figure 3(b) we evaluate rgat on Tox21. The number of samples is lower for these evaluations than for the transductive tasks, and so fewer model comparisons will be accompanied with a reasonable significance, although there are still some conclusions we can draw.

Through a thorough hyperparameter search, and incorporating various regularisation techniques, we obtained the relative mean performance of 0.72% for rgcn compared to the result reported in Wu et al. (2018), providing a much stronger baseline.

Both additive attention models match the performance of rgcn, whereas multiplicative wirgat and argat marginally outperform rgcn, although this is not significant (

and respectively).

When comparing against their constant attention counterparts, significant differences observed were for multiplicative wirgat and argat, where attention gives a relative mean performance improvements of 3.33% and 4.36% respectively. We do not observe any significant gains coming from additive attention when compared to their constant counterparts.

4 Conclusion

We have investigated a class of models we call rgat. These models act upon graph structures, inducing a masked self-attention that takes account of local relational structure as well as node features. This allows both nodes and their properties under specific relations to be dynamically assigned an importance for different nodes in the graph, and opens up graph attention mechanisms to a wider variety of problems.

We evaluted two specific attention mechanisms, wirgat and argat, under both an additive and multiplicative logit construction, and compared them to their equivalently evaluated spectral counterpart rgcn.

We find rgat perform competitively or poorly on established baselines. This behavior appears strongly task-dependent. Specifically, relational inductive tasks such as graph classification benefit from multiplicative argat, whereas transductive relational tasks, such as knowledge base completion, at least in the absence of node features, are better tackled using spectral methods like rgcn or other graph feature extraction methods like wl graph kernels.

In general we have found that wirgat should be paired with an additive logit mechanism, and fares marginally better than argat on transductive tasks, whereas argat should be paired with a multiplicative logit mechanism, and fares marginally better on inductive tasks.

We have found no cases where choosing any variation of rgat is guaranteed to significantly outperform rgcn, although we have found that in cases where rgcn can memorise the training set, we are confident that rgat will not perform as well as rgcn. Consequently, we suggest that before attempting to train rgat, a good first test is to inspect the training set performance of rgcn.

Through our thorough evaluation and presentation of the behaviours and limitations of these models, insights can be derived that will enable the discovery of more powerful model architectures that act upon relational structures. Observing that model variance on all of the tasks presented here is high, any future work developing and expanding these methods must choose larger, more challenging datasets. In addition, a comparison between the generalisation of spectral methods, like those presented here, and generalisations of rnn, like Gated Graph Sequence Networks, is a necessary ingredient for determining the most promising future direction for these models.

5 Acknowledgements

We thank Ozan Oktay for many fruitful discussions during the early stages of this work, and Jeremie Vallee for assistance with the experimental setup. We also thank April Shen, Kostis Gourgoulias and Kristian Boda, whose comments greatly improved the manuscript, as well as Claire Woodcock for support.

References

  • Allamanis et al. (2017) Allamanis, M., M. Brockschmidt, and M. Khademi
    2017.
    Learning to represent programs with graphs. CoRR, abs/1711.00740.
  • Altae-Tran et al. (2016) Altae-Tran, H., B. Ramsundar, A. S. Pappu, and V. Pande
    2016.
    Low Data Drug Discovery with One-shot Learning. Pp.  1–20.
  • Atwood and Towsley (2016) Atwood, J. and D. Towsley
    2016.

    Diffusion-Convolutional Neural Networks.

    (Nips).
  • Bergstra et al. (2013) Bergstra, J., D. Yamins, and D. Cox
    2013.
    Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In

    Proceedings of the 30th International Conference on Machine Learning

    , S. Dasgupta and D. McAllester, eds., volume 28 of Proceedings of Machine Learning Research, Pp.  115–123, Atlanta, Georgia, USA. PMLR.
  • Bronstein et al. (2017) Bronstein, M. M., J. Bruna, Y. Lecun, A. Szlam, and P. Vandergheynst
    2017.

    Geometric Deep Learning: Going beyond Euclidean data.

    IEEE Signal Processing Magazine, 34(4):18–42.
  • Bruna et al. (2013) Bruna, J., W. Zaremba, A. Szlam, and Y. LeCun
    2013.
    Spectral Networks and Locally Connected Networks on Graphs. Pp.  1–14.
  • de Vries and de Rooij (2015) de Vries, G. K. D. and S. de Rooij
    2015.
    Substructure counting graph kernels for machine learning from rdf data. Web Semant., 35(P2):71–84.
  • Defferrard et al. (2016) Defferrard, M., X. Bresson, and P. Vandergheynst
    2016.
    Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. (Nips).
  • Donahue et al. (2017) Donahue, J., L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell
    2017.
    Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):677–691.
  • Duvenaud et al. (2015) Duvenaud, D., D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams
    2015.
    Convolutional Networks on Graphs for Learning Molecular Fingerprints.
  • Frasconi et al. (1997) Frasconi, P., V. D. S. Marta, M. Gori, V. Roma, and A. Sperduti
    1997.
    On the efficient classification of data structures by neural networks.
  • Gavish et al. (2010) Gavish, M., B. Nadler, R. R. Coifman, and N. Haven
    2010.

    Multiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications to Semi-Supervised Learning.

    Icml, (i):367–374.
  • Gong and Cheng (2018) Gong, L. and Q. Cheng
    2018.
    Adaptive Edge Features Guided Graph Attention Networks.
  • Gori et al. (2005) Gori, M., M. Maggini, and L. Sarti
    2005.
    Exact and approximate graph matching using random walks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1100–1111.
  • Henaff et al. (2015) Henaff, M., J. Bruna, and Y. LeCun
    2015.
    Deep Convolutional Networks on Graph-Structured Data. Pp.  1–10.
  • Karpathy et al. (2014) Karpathy, A., G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F. Li
    2014.
    Large-scale video classification with convolutional neural networks. Proc. IEEE CVPR.
  • Kearnes et al. (2016) Kearnes, S., K. McCloskey, M. Berndl, V. Pande, and P. Riley
    2016.
    Molecular graph convolutions: moving beyond fingerprints. J. Comput. Aided. Mol. Des., 30(8):595–608.
  • Kingma and Ba (2014) Kingma, D. P. and J. Ba
    2014.
    Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Kipf and Welling (2016) Kipf, T. N. and M. Welling
    2016.
    Semi-Supervised Classification with Graph Convolutional Networks.
  • Lee et al. (2018) Lee, J. B., R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh
    2018.
    Attention Models in Graphs: A Survey. 0(1).
  • Mann and Whitney (1947) Mann, H. B. and D. R. Whitney
    1947.

    On a test of whether one of two random variables is stochastically larger than the other.

    Ann. Math. Statist., 18(1):50–60.
  • Monti et al. (2017) Monti, F., D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein
    2017.
    Geometric deep learning on graphs and manifolds using mixture model CNNs. In

    Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

    , volume 2017-January, Pp.  5425–5434.
  • Monti et al. (2018) Monti, F., O. Shchur, A. Bojchevski, O. Litany, S. Günnemann, and M. M. Bronstein
    2018.
    Dual-Primal Graph Convolutional Networks. Pp.  1–11.
  • Paulheim and Fümkranz (2012) Paulheim, H. and J. Fümkranz
    2012.
    Unsupervised generation of data mining features from linked open data. In Proc. 2nd Int. Conf. Web Intell. Min. Semant. - WIMS ’12, P.  1, New York, New York, USA. ACM Press.
  • Ramsundar et al. (2015) Ramsundar, B., S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande
    2015.
    Massively Multitask Networks for Drug Discovery. (Icml).
  • Ristoski and Paulheim (2016) Ristoski, P. and H. Paulheim
    2016.
    RDF2Vec: RDF graph embeddings for data mining. In

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , volume 9981 LNCS, Pp.  498–514.
  • Rustamov and Guibas (2013) Rustamov, R. and L. Guibas
    2013.
    Wavelets on graphs via deep learning. Nips, Pp.  1–9.
  • Scarselli et al. (2009) Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini
    2009.
    The graph neural network model. Trans. Neur. Netw., 20(1):61–80.
  • Schlichtkrull et al. (2018) Schlichtkrull, M., T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling
    2018.
    Modeling Relational Data with Graph Convolutional Networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 10843 LNCS, Pp.  593–607.
  • Shervashidze et al. (2011) Shervashidze, N., P. Schweitzer, E. Jan van Leeuwen, K. Mehlhorn, and K. Borgwardt
    2011.
    Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res., 12:2539–2561.
  • Sperduti and Starita (1997) Sperduti, A. and A. Starita
    1997.
    Supervised neural networks for the classification of structures. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 8:714–35.
  • Szlam et al. (2005) Szlam, A. D., M. Maggioni, R. R. Coifman, and J. C. BremerJr.
    2005.
    Diffusion-driven multiscale analysis on manifolds and graphs: top-down and bottom-up constructions. P.  59141D.
  • Vaswani et al. (2017) Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin
    2017.
    Attention is all you need. CoRR, abs/1706.03762.
  • Veličković et al. (2017) Veličković, P., G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio
    2017.
    Graph Attention Networks.
  • Wang et al. (2018) Wang, Y., Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon
    2018.
    Dynamic Graph CNN for Learning on Point Clouds. Technical report.
  • Wu et al. (2018) Wu, Z., B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande
    2018.
    MoleculeNet: A benchmark for molecular machine learning. Chemical Science, 9(2):513–530.
  • Zhang et al. (2016) Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals
    2016.
    Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530.
  • Zhang et al. (2018) Zhang, J., X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung
    2018.
    GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs.

Appendix A Tox21 Results

For completeness, we present the training, validation and test set performance of our models in addition to those in Wu et al. (2018) in Table 3.

Model Training Validation Test
Multitask
Bypass
Weave
RGCN
RGCN (ours)
Additive attention
C-wirgat
wirgat
C-argat
argat
Multiplicative attention
C-wirgat
wirgat
C-argat
argat
Table 3: Graph classification mean auc across all 12 tasks (mean and standard deviation over 3 splits) for Multitask (Ramsundar et al., 2015), Bypass (Wu et al., 2018), Weave (Kearnes et al., 2016), rgcn (Altae-Tran et al., 2016), our implementation of rgcn, additive and multiplicative attention versions of wirgat and argat (this work). Training, validation and test performance is reported on the splits provided in Wu et al. (2018). Best performance in class in boldened, and best performance overall is underlined.

Appendix B Hyperparameters

We perform hyperparameter optimisation using hyperopt Bergstra et al. (2013) with priors for the transductive tasks specified in Table 4 and priors for the inductive tasks specified in Table 5. In all experiments we use the Adam optimiser (Kingma and Ba, 2014).

Hyperparameter Prior
Graph kernel units
Heads
Feature dropout rate
Edge dropout
basis size
Graph layer 1 L2 coef
Graph layer 2 L2 coef
basis size
Graph layer 1 L2 coef
Graph layer 2 L2 coef
Learning rate
Use bias
Use batch normalisation
Table 4: Priors on the hyperparameter search space for the transductive tasks. When multihead attention is used, the number of units per head is appropriately reduced in order to keep the total number of output units of an rgat layer independent of the number of heads.
Hyperparameter Prior
Graph kernel units
Dense units
Heads
Feature dropout
Edge dropout
L2 coef (1)
L2 coef (2)
L2 coef (1)
L2 coef (2)
Learning rate
Use bias
Use batch normalisation
Table 5: Priors on the hyperparameter for the inductive task. The batch size was held at 64, and no bases decomposition is used. When multihead attention is used, the number of units per head is appropriately reduced in order to keep the total number of output units of an rgat layer independent of the number of heads.

Appendix C Charts

To aid interpretability of the results presented in Table 2 we present a chart representation in Figure 4.

(c) AIFB
(d) MUTAG
(e) TOX21
Figure 4: (a) and (b): Blue Baseline entity classification accuracy (mean and standard deviation over 10 seeds) for FEAT (Paulheim and Fümkranz, 2012), WL (Shervashidze et al., 2011; de Vries and de Rooij, 2015), RDF2Vec (Ristoski and Paulheim, 2016) and rgcn (Schlichtkrull et al., 2018), and (mean and standard deviation over 200 runs) for our implementation of rgcn. Yellow Entity classification accuracy (mean and standard deviation over 200 seeds) for additive attention (this work). Red Entity classification accuracy (mean and standard deviation over 200 seeds) for multiplicative attention (this work). Test performance is reported on the splits provided in Ristoski and Paulheim (2016). (c): Blue Baseline graph classification mean roc auc across all 12 tasks (mean and standard deviation over 3 splits) for Multitask (Ramsundar et al., 2015), Bypass (Wu et al., 2018), Weave (Kearnes et al., 2016), rgcn (Altae-Tran et al., 2016), and (mean and standard deviation over 3 splits, 2 seeds per split) our implementation of rgcn. Yellow Additive attention graph classification mean roc-auc (mean and standard deviation over 200 seeds) across all 12 tasks (this work). Red Multiplicative attention graph classification mean roc-auc (mean and standard deviation over 200 seeds) across all 12 tasks (this work). All raw values are given in Table 2.

Appendix D Cumulative distribution functions

To aid further insight into our results, we present the cdf for each model on each task in Figure 4. In this context, we treat the performance metric of interest during the hyperparameter search as the empirical distribution of some random variable . We then define its cdf in the standard way

(16)

where is the probability that takes on a value less than or equal to

. The cdf allows one to gauge whether any given architecture typically performs better than another across the whole space, rather than comparison of the tuned hyperparameters, which in some cases may be outliers in terms of generic behavior for that architecture.

(a) AIFB
(b) MUTAG
(c) TOX21
Figure 5: cdf for all models on a) AIFB, b) MUTAG and c) TOX21. Green lines correspond to our implementation of rgcn, blue lines correspond to argat, and orange lines correspond to wirgat. Solid lines correspond to additive attention (and rgcn), whereas dashed lines correspond to multiplicative attention. A lower cdf value is better in the sense that a greater proportion of models of achieve a higher value of that metric.
Aifb

Additive and multiplicative argat perform poorly for most areas of the hyperparameter space, whereas rgcn and multiplicative wirgat perform comparably across the entire hyperparameter space.

Mutag

Interestingly, the models that have a greater amount of hyperparameter space covering poor performance (i.e. rgcn, multiplicative and additive wirgat) are also the models which also have a greater amount of hyperparameter space covering good performance. In other words, on the MUTAG, the argat prior resulted in a model whose test set performance was relatively insensitive to hyperparameter choice when compared against the other candidates. Given that the argat model was the most flexible of the models evaluated, and that it was able to memorise the training set, this suggests that the task contained insufficient information for the model to learn its attention mechanism. Given that wirgat was able to at least partially learn to its attention mechanism suggests that wirgat is less data hungry than argat.

Tox21

The multiplicative attention models fare poorly on the majority of the hyperparameter space compared to the other models. There is a slice of the hyperparameter space where the multiplicative attention models outperform the other models, however, indicating that although they are difficult to train, it may be worth spending time hyperoptimising them if you need the best performing model on a relational inductive task. The additive attention models and rgcn perform comparably across the entirety of the hyperparameter space and generally perform better than the multiplicative methods except for the very small region of hyperparameter space mentioned above.

Appendix E Significance testing

In order to determine if any of our model comparisons are significant, we employ the one-sided Mann-Whitney test Mann and Whitney (1947) as we are interested in the direction of movement (i.e. performance) and do not want to make any parametric assumptions about model response. For two populations and :

  • The null hypothesis

    is that the two populations are equal, and

  • The alternative hypothesis is that the probability of an observation from population X exceeding an observation from population Y is larger than the probability of an observation from Y exceeding an observation from X; i.e., .

We treat the empirical distributions of Model A as samples from population and the empirical distributions of Model B as samples from population . This allows us a window into whether, given a task, whether which is the better model out of a pair of models. Results on AIFB, MUTAG and TOX21 are given in Figure 6, Figure 7 and Figure 8 respectively.

Figure 6: The -values for Mann-Whitney test with alternative hypothesis of Model outperforming Model on AIFB.
Figure 7: The -values for Mann-Whitney test with alternative hypothesis of Model outperforming Model on MUTAG.
Figure 8: The -values for Mann-Whitney test with alternative hypothesis of Model outperforming Model on TOX21.