rgat
A TensorFlow implementation of Relational Graph Attention Networks, paper: https://arxiv.org/abs/1904.05811
view repo
We investigate Relational Graph Attention Networks, a class of models that extends nonrelational graph attention mechanisms to incorporate relational information, opening up these methods to a wider variety of problems. A thorough evaluation of these models is performed, and comparisons are made against established benchmarks. To provide a meaningful comparison, we retrain Relational Graph Convolutional Networks, the spectral counterpart of Relational Graph Attention Networks, and evaluate them under the same conditions. We find that Relational Graph Attention Networks perform worse than anticipated, although some configurations are marginally beneficial for modelling molecular properties. We provide insights as to why this may be, and suggest both modifications to evaluation strategies, as well as directions to investigate for future work.
READ FULL TEXT VIEW PDFA TensorFlow implementation of Relational Graph Attention Networks, paper: https://arxiv.org/abs/1904.05811
A TensorFlow implementation of Relational Graph Attention Networks, paper: https://arxiv.org/abs/1904.05811
None
cnn successfully solve a variety of tasks in Euclidean gridlike domains, such as image captioning (Donahue et al., 2017)
and classifying videos
(Karpathy et al., 2014). cnn are successful because they assume the data is locally stationary and compositional (Defferrard et al., 2016; Henaff et al., 2015; Bruna et al., 2013).However, data often occurs in the form of graphs or manifolds, which are classic examples of nonEuclidean domains. Specific instances include knowledge bases, molecules, and point clouds captured by 3D data acquisition devices (Wang et al., 2018). The generalisation of nn to nonEuclidean domains is termed gdl, and may be roughly divided into spectral, spatial and hybrid approaches (Bronstein et al., 2017).
Spectral approaches (Defferrard et al., 2016), most notably gcn (Kipf and Welling, 2016), are limited by their basisdependence. A filter that is learned with respect to a basis on one domain is not guaranteed to behave similarly when applied to another basis and domain. Spatial approaches are limited by an absence of shift invariance and lack of coordinate system (Duvenaud et al., 2015; Atwood and Towsley, 2016; Monti et al., 2017). Hybrid approaches combine spectral and spatial approaches, trading their advantages and deficiencies against eachother (Bronstein et al., 2017; Rustamov and Guibas, 2013; Szlam et al., 2005; Gavish et al., 2010).
A recent approach that began with gat, applied attention mechanisms to graphs, and does not share these limitations (Veličković et al., 2017; Gong and Cheng, 2018; Zhang et al., 2018; Monti et al., 2018; Lee et al., 2018).
An alternative direction has been to generalise rnn from sequential message passing on onedimensional signals, to message passing on graphs (Sperduti and Starita, 1997; Frasconi et al., 1997; Gori et al., 2005). Incorporating gating mechanisms led to the development of ggnn (Scarselli et al., 2009; Allamanis et al., 2017)^{1}^{1}1We note that ggnn support relation types. Evaluating these models on the tasks presented here is necessary to acquire a better understanding neural models of relational data..
rgcn have been proposed as an extension of gcn to the domain of relational graphs (Schlichtkrull et al., 2018). This model has achieved impressive performance on node classification and link prediction tasks, however, its mechanisms still resides within spectral methods and shares their deficiencies. The focus of this work investigate generalisations of rgcn away from its spectral origins.
We take rgcn as a starting point, and investigate a class of models we term rgat, extending attention mechanisms to the relational graph domain. We consider two variants, wirgat and argat, each with either additive or multiplicative attention. We perform an extensive hyperparameter search, and evaluate these models on challenging transductive node classification and inductive graph classification tasks. These models are compared against established benchmarks, as well as a retuned rgcn model.
We show that rgat performs worse than expected, although some configurations produce marginal benefits on inductive graph classification tasks. In order to aid further investigation in this direction, we present the full cdf for the hyperparameter searches in Appendix D
, and statistical hypothesis tests in
Appendix E. We also provide a vectorised, sparse, batched implementation of rgat and rgcn in TensorFlow which is compatible with eager execution mode to open up research into these models to a wider audience^{2}^{2}2https://github.com/Babylonpartners/rgat..We follow the construction of the gat layer in Veličković et al. (2017), extending to the relational setting, using ideas from Schlichtkrull et al. (2018).
The input to the layer is a graph with relation types and nodes. The
node is represented by a feature vector
, and the features of all nodes are summarised in the feature matrix . The output of the layer is the transformed feature matrix , where is the transformed feature vector of the node.Different relations convey distinct pieces of information. The update rule of Schlichtkrull et al. (2018) made this manifest by assigning each node a distinct intermediate representation under relation
(1) 
where is the intermediate representation feature matrix under relation , and
are the learnable parameters of a shared linear transformation.
Following Veličković et al. (2017); Zhang et al. (2018), we assume the attention coefficient between two nodes is based only on the features of those nodes up to a neighborhoodlevel normalisation. To keep computational complexity linear in , we assume that, given linear transformations
, the logits
of each relation are independent(2) 
and indicate the importance of node ’s intermediate representation to that of node under relation . The attention is masked so that, for node , coefficients exist only for , where denotes the set of neighbor indices of node under relation .
The logits are composed from queries and keys, and specify how the values, i.e. the intermediate representations , will combine to produce the updated node representations (Vaswani et al., 2017). A separate query kernel and key kernel project the intermediate representations , into query and key representations of dimensionality
(3) 
For convenience, the query and key kernels are combined to form the attention kernels . These query and key representations are the building blocks of the two specific realisations of in Equation 2 that we now consider.
The first realisation of we consider is the relational modification of the logit mechanism of Veličković et al. (2017)
(4) 
where the query and key dimensionality are both , and and are scalar flattenings of their onedimensional vector counterparts . We refer to any instance of rgat using logits of the form in Equation 4 as additive rgat.
The second realisation we consider is the multiplicative mechanism of Vaswani et al. (2017); Zhang et al. (2018)^{3}^{3}3The form of our mechanism is not precisely that of Zhang et al. (2018) as they also consider residual concatenation and gating mechanism applied across the heads of the attention mechanism.
(5) 
where the query and key dimensionality can be any positive integer. We refer to any instance of rgat using logits of the form in Equation 4 as multiplicative rgat.
It should be noted that there are many types of attention mechanisms beyond vanilla additive and multiplicative. These include mechanisms leveraging the structure of the dual graph Monti et al. (2018) as well as learned edge features Gong and Cheng (2018).
The attention coefficients should be comparable across nodes. This can be achieved by applying softmax appropriately to any logits . We investigate two candidates, each encoding a different prior belief about how the importance of different relations.
The simplest way to take the softmax over the logits of Equation 4 or Equation 5 is to do so independently for each relation
(6) 
We call the attention in Equation 6 wirgat (wirgat), and it is shown in Figure 1
. This mechanism encodes the prior that relation importance is a purely global property of the graph by implementing an independent probability distribution over nodes in the neighborhood of
for each relation . Explicitly, for any node and relation , nodes yield competing attention coefficients and with sizes depending on their corresponding representations and . There is no competition between any attention coefficients and for all nodes and nodes where irrespective of node representations.An alternative way to take the softmax over the logits of Equation 4 or Equation 5 is across node neighborhoods irrespective of relation
(7) 
We call the attention in Equation 7 Equation 6 argat (argat), and it is shown in Figure 2. This mechanism encodes the prior that relation importance is a local property of the graph by implementing a single probability distribution over the different representations for nodes j in the neighborhood of node . Explicitly, for any node and all , all nodes and yield competing attention coefficients and with sizes depending on their corresponding representations and .
For comparison, the coefficients of rgcn are given by . This encodes the prior that the intermediate representations of nodes to node under relation are equally important.
Combining the attention mechanism of either Equation 6 or Equation 7 with the neighborhood aggregation step of Schlichtkrull et al. (2018) gives
(8) 
where represents an optional nonlinearity. Similar to Vaswani et al. (2017); Veličković et al. (2017), we also find that using multiple heads in the attention mechanism can enhance performance
(9) 
where denotes vector concatenation, are the normalised attention coefficients under relation computed by either wirgat or argat, and is the head specific intermediate representation of node under relation .
It might be interesting to consider cases where there are a different number of heads for different relationship types, as well as when a mixture of argat and wirgat produce the attention coefficients, however, we leave that subject for future investigation and will not consider it further.
The number of parameters in the rgat layer increases linearly with the number of relations and heads , and can lead quickly to overparameterisation. In rgcn it was found that decomposing the kernels was beneficial for generalisation, although it comes at the cost of increased model bias (Schlichtkrull et al., 2018). We follow this approach, decomposing both the kernels as well as the kernels of attention mechanism into basis matrices and basis vectors
(10) 
where are basis coefficients. We consider models using full and decomposed and .
For the transductive task of semisupervised node classification, we employ a twolayer rgat architecture shown in Figure 2(a)
. We use a relu activation after the rgat concat layer, and a nodewise softmax on the final layer to produce an estimate for the probability that the
label is in the class(11) 
We then employ a masked crossentropy loss to constrain the network updates to the subset of nodes whose labels are known
(12) 
where is the onehot representation of the true label for node .
For inductive graph classification, we employ a twolayer rgat followed by a graph gather and dense network architecture shown in Figure 2(b). We use relu activations after each rgat layer and the first dense layer. We use a activation after the , which is a vector concatenation of the mean of the node representations with the featurewise of the node representations
(13) 
The final dense layer then produces logits of the size , and we apply a taskwise softmax to its output to produce an estimate for the probability that the graph is in class for a given task , analogous to Equation 11. Weighted crossentropy loss is then used to form the learning objective
(14) 
where and are the weights and onehot true labels for task and class respectively.
We evaluate the models on transductive and inductive tasks. Following the experimental setup of Schlichtkrull et al. (2018) for the transductive tasks, we evaluate our model on the rdf datasets AIFB and MUTAG. We also evaluate our model for an inductive task on the molecular dataset, Tox21. Details of these data sets are given in Table 1. For further details on the transductive and inductive datasets, please see Ristoski and Paulheim (2016) and Wu et al. (2018) respectively.
Datasets  AIFB  MUTAG  Tox21 
Task  Transductive  Transductive  Inductive 
Nodes  8,285 (1 graph)  23,644 (1 graph)  145,459 (8014 graphs) 
Edges  29,043  74,227  151,095 
Relations  45  23  4 
Labelled  176  340  96,168 (12 per graph) 
Classes  4  2  12 (multilabel) 
Train nodes  112  218  (6411 graphs) 
Validation nodes  28  54  (801 graphs) 
Test nodes  28  54  (802 graphs) 
We consider as a baseline the recent stateoftheart results from Schlichtkrull et al. (2018) obtained with a twolayer RGCN model with 16 hidden units and basis function decomposition. We also include the same challenging baselines of FEAT (Paulheim and Fümkranz, 2012), WL (Shervashidze et al., 2011; de Vries and de Rooij, 2015) and RDF2Vec (Ristoski and Paulheim, 2016). Indepth details of these baselines are given by Ristoski and Paulheim (2016).
As baselines for Tox21, we compare against the most competitive methods on Tox21 reported in Wu et al. (2018). Specifically, we compare against deep multitask networks Ramsundar et al. (2015), deep bypass multitask networks Wu et al. (2018), Weave Kearnes et al. (2016), and a rgcn model whose relational structure is determined by the degree of the node to be updated AltaeTran et al. (2016). Specifically, up to and including some maximum degree ,
(15) 
where is a degreespecific linear transformation for selfconnections, is a degreespecific linear transformation for neighbours into their intermediate representations , and is a degreespecific bias. Any update for any degree gets assigned to the update for the maximum degree .
For the transductive learning tasks, the architecture discussed in Section 2.2 was applied. Its hyperparameters were optimised for both AIFB and MUTAG on their respective training/validation sets defined in Ristoski and Paulheim (2016)
, using 5fold cross validation. Using the found hyperparameters, we retrain on the full training set and report results on the test set across 200 seeds. We employ early stopping on the validation set during crossvalidation to determine the number of epochs we will run on the final training set. Hyperparameter optimisation details are given in
Table 4 of Appendix B.For the inductive learning tasks, the architecture discussed in Section 2.3 was applied. In order to optimise hyperparameters once, ensure no data leakage, but also provide comparable benchmarks to those presented in Wu et al. (2018), three benchmark splits were taken from the MolNet benchmarks^{4}^{4}4Retrieved from http://deepchem.io.s3websiteuswest1.amazonaws.com/trained_models/Hyperparameter_MoleculeNetv3.tar.gz.
, and graphs belonging to any of the test sets were isolated. Using the remaining graphs we performed a hyperparameter search using 2 folds of 10fold cross validation. Using the found hyperparameters, we then retrained on the three benchmark splits provided with 2 seeds each, giving an unbiased estimate of model performance. We employ early stopping during both the crossvalidation and final run (the validation set of the inductive task is available for the final benchmark, in contrast to the transductive tasks) to determine the number of training epochs. Hyperparameter optimisation details are given in
Table 5 of Appendix B.In all experiments, we train with the attention mechanism turned on. At evaluation time, however, we report results with and without the attention mechanism to provide insight into whether the attention mechanism helps. argat (wirgat) without the attention is called Cargat (Cwirgat).


(a) Entity classification results accuracy (mean and standard deviation over 10 seeds) for FEAT
(Paulheim and Fümkranz, 2012), WL (Shervashidze et al., 2011; de Vries and de Rooij, 2015), RDF2Vec (Ristoski and Paulheim, 2016) and rgcn (Schlichtkrull et al., 2018), and (mean and standard deviation over 200 seeds) for our implementation of rgcn, as well as additive and multiplicative attention for (C)wirgat and (C)argat (this work). Test performance is reported on the splits provided in Ristoski and Paulheim (2016). (b) Graph classification mean roc auc across all 12 tasks (mean and standard deviation over 3 splits) for Multitask (Ramsundar et al., 2015), Bypass (Wu et al., 2018), Weave (Kearnes et al., 2016), rgcn (AltaeTran et al., 2016), and (mean and standard deviation over 3 splits, 2 seeds per split) our implementation of rgcn, additive and multiplicative attention for (C)wirgat and (C)argat (this work). Test performance is reported on the splits provided in Wu et al. (2018). Best performance in class in boldened, and best performance overall is underlined. For completeness, we present the training and validation mean rocauc alongside the test rocauc in Appendix A. For a graphical representation of these results, see Figure 4 in Appendix C.Model means and standard deviations are presented in Table 2. To provide a picture of characteristic model behaviour, the cdf for the hyperparameter sweep are presented in Figure 5 of Appendix D. To draw meaningful conclusions, we compare against our own implementation of rgcn rather than the results reported in Schlichtkrull et al. (2018); Wu et al. (2018).
We will occasionally employ a onesided hypothesis test in order to make concrete statements about model performance. The details and complete results of this test are presented in Appendix E
. When we refer to significant results this corresponds to a test statistic supporting our hypothesis with a
value .In Figure 3(a) we evaluate rgat on MUTAG and AIFB. With additive attention, wirgat outperforms argat, consistent with Schlichtkrull et al. (2018). Interestingly, when employing multiplicative attention, the converse appears true. For node classification tasks on rdf data, this indicates that the importance of a particular relation type does not vary much (if at all) across the graph unless one employs a multiplicative comparison^{5}^{5}5Or potentially other comparisons beyond additive or constant, i.e. rgcn. between node representations.
On AIFB, the best to worst performing models are: 1) additive wirgat 2) multiplicative argat 3) rgcn 4) additive argat, and 5) multiplicative wirgat, with each comparison being significant.
When comparing against their constant attention counterparts, the significant differences observed were for additive and multiplicative argat, where attention gives a relative mean performance improvements of 1.03% and 0.31% respectively, and multiplicative wirgat, where attention gives a relative mean performance drop of 0.84%.
Although we present stateofthe art result on AIFB with additive wirgat, since its performance with and without attention are not significantly different, it is unlikely that this is due to the attention mechanism itself, at least at inference time. Over the hyperparameter space, additive wirgat and rgcn are comparable in performance (see Figure 4(a) in Appendix D), leading us to conclude that the result is more likely attributable to finding a better hyperparameter point for additive wirgat during the search.
On MUTAG, the best to worst performing models are: 1) rgcn 2) multiplicative argat 3) additive wirgat tied with multiplicative wirgat, and 4) additive argat, with each comparison being significant.
When comparing against their constant attention counterparts, the significant differences observed were for additive wirgat and argat, where attention gives relative mean performance improvements of 0.66% and 2.90% respectively, and multiplicative argat, where attention gives a relative mean performance drop of 1.63%.
We note that rgcn consistently outperforms rgat on MUTAG, contrary to what might be expected (Schlichtkrull et al., 2018). The result is surprising given that rgcn lies within the parameter space of rgat (where the attention kernel is zero), a configuration we check through evaluating Cwirgat. In our experiments we have observed that both rgcn and rgat can memorise the MUTAG training set with accuracy without difficulty (this is not the case for AIFB). The performance gap between rgcn and rgat could then be explained by the following:
During training, the rgat layer uses its attention mechanism to solve the learning objective. Once the objective is solved, the model is not encouraged by the loss function to seek a point in the parameter space that would also behave well when attention is set to a normalising constant within neighbourhoods (i.e. the parameter space point that would be found by rgcn).
The rdf tasks are transductive, meaning that a basisdependent spectral approach is sufficient to solve them. As rgcn already memorises the MUTAG training set, a model more complex^{6}^{6}6Measured in terms of mdl, for example. than rgcn, for example rgat, that can also memorise the training set is unlikely to generalise as well, although this is a hotly debated topic  see e.g. Zhang et al. (2016).
We employed a suite of regularisation techniques to get rgat to generalise on MUTAG, including L2norm penalties, dropout in multiple places, batch normalisation, parameter reduction and early stopping, however, no evaluated harshly regularised points for rgat generalise well on MUTAG.
Our final observation is that the attention mechanism presented in Section 2.1 relies on node features. The node features for the above tasks are learned from scratch (the input feature matrix is a onehot node index) as part of the task. It is possible that in this semisupervised setup, there is insufficient signal in the data to learn both the input node embeddings as well as a meaningful attention mechanism to act upon them.
In Figure 3(b) we evaluate rgat on Tox21. The number of samples is lower for these evaluations than for the transductive tasks, and so fewer model comparisons will be accompanied with a reasonable significance, although there are still some conclusions we can draw.
Through a thorough hyperparameter search, and incorporating various regularisation techniques, we obtained the relative mean performance of 0.72% for rgcn compared to the result reported in Wu et al. (2018), providing a much stronger baseline.
Both additive attention models match the performance of rgcn, whereas multiplicative wirgat and argat marginally outperform rgcn, although this is not significant (
and respectively).When comparing against their constant attention counterparts, significant differences observed were for multiplicative wirgat and argat, where attention gives a relative mean performance improvements of 3.33% and 4.36% respectively. We do not observe any significant gains coming from additive attention when compared to their constant counterparts.
We have investigated a class of models we call rgat. These models act upon graph structures, inducing a masked selfattention that takes account of local relational structure as well as node features. This allows both nodes and their properties under specific relations to be dynamically assigned an importance for different nodes in the graph, and opens up graph attention mechanisms to a wider variety of problems.
We evaluted two specific attention mechanisms, wirgat and argat, under both an additive and multiplicative logit construction, and compared them to their equivalently evaluated spectral counterpart rgcn.
We find rgat perform competitively or poorly on established baselines. This behavior appears strongly taskdependent. Specifically, relational inductive tasks such as graph classification benefit from multiplicative argat, whereas transductive relational tasks, such as knowledge base completion, at least in the absence of node features, are better tackled using spectral methods like rgcn or other graph feature extraction methods like wl graph kernels.
In general we have found that wirgat should be paired with an additive logit mechanism, and fares marginally better than argat on transductive tasks, whereas argat should be paired with a multiplicative logit mechanism, and fares marginally better on inductive tasks.
We have found no cases where choosing any variation of rgat is guaranteed to significantly outperform rgcn, although we have found that in cases where rgcn can memorise the training set, we are confident that rgat will not perform as well as rgcn. Consequently, we suggest that before attempting to train rgat, a good first test is to inspect the training set performance of rgcn.
Through our thorough evaluation and presentation of the behaviours and limitations of these models, insights can be derived that will enable the discovery of more powerful model architectures that act upon relational structures. Observing that model variance on all of the tasks presented here is high, any future work developing and expanding these methods must choose larger, more challenging datasets. In addition, a comparison between the generalisation of spectral methods, like those presented here, and generalisations of rnn, like Gated Graph Sequence Networks, is a necessary ingredient for determining the most promising future direction for these models.
We thank Ozan Oktay for many fruitful discussions during the early stages of this work, and Jeremie Vallee for assistance with the experimental setup. We also thank April Shen, Kostis Gourgoulias and Kristian Boda, whose comments greatly improved the manuscript, as well as Claire Woodcock for support.
DiffusionConvolutional Neural Networks.
(Nips).Proceedings of the 30th International Conference on Machine Learning
, S. Dasgupta and D. McAllester, eds., volume 28 of Proceedings of Machine Learning Research, Pp. 115–123, Atlanta, Georgia, USA. PMLR.Geometric Deep Learning: Going beyond Euclidean data.
IEEE Signal Processing Magazine, 34(4):18–42.Multiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications to SemiSupervised Learning.
Icml, (i):367–374.On a test of whether one of two random variables is stochastically larger than the other.
Ann. Math. Statist., 18(1):50–60.Proceedings  30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
, volume 2017January, Pp. 5425–5434.Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, volume 9981 LNCS, Pp. 498–514.For completeness, we present the training, validation and test set performance of our models in addition to those in Wu et al. (2018) in Table 3.
Model  Training  Validation  Test 

Multitask  
Bypass  
Weave  
RGCN  
RGCN (ours)  
Additive attention  
Cwirgat  
wirgat  
Cargat  
argat  
Multiplicative attention  
Cwirgat  
wirgat  
Cargat  
argat 
We perform hyperparameter optimisation using hyperopt Bergstra et al. (2013) with priors for the transductive tasks specified in Table 4 and priors for the inductive tasks specified in Table 5. In all experiments we use the Adam optimiser (Kingma and Ba, 2014).
Hyperparameter  Prior 

Graph kernel units  
Heads  
Feature dropout rate  
Edge dropout  
basis size  
Graph layer 1 L2 coef  
Graph layer 2 L2 coef  
basis size  
Graph layer 1 L2 coef  
Graph layer 2 L2 coef  
Learning rate  
Use bias  
Use batch normalisation 
Hyperparameter  Prior 

Graph kernel units  
Dense units  
Heads  
Feature dropout  
Edge dropout  
L2 coef (1)  
L2 coef (2)  
L2 coef (1)  
L2 coef (2)  
Learning rate  
Use bias  
Use batch normalisation 
To aid interpretability of the results presented in Table 2 we present a chart representation in Figure 4.
To aid further insight into our results, we present the cdf for each model on each task in Figure 4. In this context, we treat the performance metric of interest during the hyperparameter search as the empirical distribution of some random variable . We then define its cdf in the standard way
(16) 
where is the probability that takes on a value less than or equal to
. The cdf allows one to gauge whether any given architecture typically performs better than another across the whole space, rather than comparison of the tuned hyperparameters, which in some cases may be outliers in terms of generic behavior for that architecture.
Additive and multiplicative argat perform poorly for most areas of the hyperparameter space, whereas rgcn and multiplicative wirgat perform comparably across the entire hyperparameter space.
Interestingly, the models that have a greater amount of hyperparameter space covering poor performance (i.e. rgcn, multiplicative and additive wirgat) are also the models which also have a greater amount of hyperparameter space covering good performance. In other words, on the MUTAG, the argat prior resulted in a model whose test set performance was relatively insensitive to hyperparameter choice when compared against the other candidates. Given that the argat model was the most flexible of the models evaluated, and that it was able to memorise the training set, this suggests that the task contained insufficient information for the model to learn its attention mechanism. Given that wirgat was able to at least partially learn to its attention mechanism suggests that wirgat is less data hungry than argat.
The multiplicative attention models fare poorly on the majority of the hyperparameter space compared to the other models. There is a slice of the hyperparameter space where the multiplicative attention models outperform the other models, however, indicating that although they are difficult to train, it may be worth spending time hyperoptimising them if you need the best performing model on a relational inductive task. The additive attention models and rgcn perform comparably across the entirety of the hyperparameter space and generally perform better than the multiplicative methods except for the very small region of hyperparameter space mentioned above.
In order to determine if any of our model comparisons are significant, we employ the onesided MannWhitney test Mann and Whitney (1947) as we are interested in the direction of movement (i.e. performance) and do not want to make any parametric assumptions about model response. For two populations and :
The null hypothesis
is that the two populations are equal, andThe alternative hypothesis is that the probability of an observation from population X exceeding an observation from population Y is larger than the probability of an observation from Y exceeding an observation from X; i.e., .
We treat the empirical distributions of Model A as samples from population and the empirical distributions of Model B as samples from population . This allows us a window into whether, given a task, whether which is the better model out of a pair of models. Results on AIFB, MUTAG and TOX21 are given in Figure 6, Figure 7 and Figure 8 respectively.