Natural Graph Networks

07/16/2020 ∙ by Pim de Haan, et al. ∙ 179

Conventional neural message passing algorithms are invariant under permutation of the messages and hence forget how the information flows through the network. Studying the local symmetries of graphs, we propose a more general algorithm that uses different kernels on different edges, making the network equivariant to local and global graph isomorphisms and hence more expressive. Using elementary category theory, we formalize many distinct equivariant neural networks as natural networks, and show that their kernels are 'just' a natural transformation between two functors. We give one practical instantiation of a natural network on graphs which uses a equivariant message network parameterization, yielding good performance on several benchmarks.



Graph theory applied in AI NETWORK analysis , interesting!


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph-structured data is among the most ubiquitous forms of structured data used in machine learning and efficient practical neural network algorithms for processing such data have recently received much attention


. Because of their scalability to large graphs, graph convolutional neural networks or message passing networks are widely used. However, it has been shown

[xu2018powerful] that such networks, which pass messages along the edges of the graph and aggregate them in a permutation invariant manner, are fundamentally limited in their expressivity.

More expressive equivariant graph networks exist [kondor2018covariant, maron2018invariant], but these treat the entire graph as a monolithic linear structure (e.g. adjacency matrix) and as a result their computational cost scales superlinearly with the size of the graph. In this paper we ask the question: how can we design maximally expressive graph networks that are equivariant to global node permutations while using only local computations?

If we restrict a global node relabeling / permutation to a local neighbourhood, we obtain a graph isomorphism between local neighbourhoods (see Figure 1). If a locally connected network is to be equivariant to global node relabelings, the message passing scheme should thus process isomorphic neighbourhoods in an identical manner. Concretely, this means that weights must be shared between isomorphic neighbourhoods. Moreover, when a neighbourhood is symmetrical (Figure 1), the convolution kernel has to satisfy an equivariance constraint with respect to the symmetry group of the neighbourhood.

Local equivariance has previously been used in gauge equivariant neural networks [cohen2019gauge]. However, as the local symmetries of a graph are different on different edges, we do not have a single gauge group here. Instead, we have a symmetry groupoid. Using elementary category theory, we formalize this into natural networks, a general framework for constructing equivariant message passing networks, which subsumes prior work on equivariance on manifolds and homogeneous spaces. In this framework, an equivariant kernel is “just” a natural transformation between two functors.

When natural graph networks (NGNs) are applied to graphs that are regular lattices, such as a 2D square grid, or to a highly symmetrical grid on the icosahedron, one recovers conventional equivariant convolutional neural networks [cohen2016group, cohen2019gauge]

. However, when applied to irregular grids, like knowledge graphs, which generally have few symmetries, the derived kernel constraints themselves lead to impractically little weight sharing. We address this by parameterizing the kernel with a message network, a equivariant graph network which takes as input the local graph structure. We show that our kernel constraints coincide with the constraints on the message network being equivariant to node relabelings, making this construction universal.

2 Technical Background: Global and Local Graph Networks

A key property of a graph is that its nodes do not have a canonical ordering. Nevertheless, an implementation in a computer must choose an arbitrary ordering of nodes. A graph neural network should therefore process information on graphs independently of that order. In this section we discuss two classes of equivariant graph neural networks, which we refer to as Global Equivariant Graph Networks (GEGNs) and Local Invariant Graph Networks (LIGNs) or message passing networks.

2.1 Global Equivariant Graph Networks

The general problem of building neural networks that operate on graph features has been discussed by, among others, kondor2018covariant, maron2018invariant, maron2019provably. These methods encode all data of a graph of

nodes, including adjacency structure, node features and other information into tensors. For example, node features can be encoded in a

-dimensional order 1 tensor (a vector) and the adjacency matrix as a

dimensional order 2 tensor (a matrix). When building practical networks, one generally uses multiple copies of such tensors (i.e. a direct sum), corresponding to the channels in a CNN. For notational simplicity, we omit this multiplicity in this exposition, unless otherwise mentioned.

A relabelling of the node order of such a graph is simply a permutation over symbols, or an element of the symmetric group , representable as a bijection . An order tensor feature transforms under into by permuting all the indices. For example, an order 1 vector feature transforms as and an order 2 adjacency matrix transforms as . An equivariant graph network, mapping an order tensor to an order tensor, , should be equivariant under this action:


The composition of such equivariant transformations is equivariant. Optionally, it can be followed by an invariant function , such that , to create an invariant graph network.

General solutions to graph equivariant linear transformations have been introduced in

kondor2018covariant, maron2018invariant and a construction based on combining MLPs by maron2019provably. What these solutions have in common is that they globally transform the tensor feature at once, which can be powerful (even universal), but has a computational cost that is polynomial in the number of nodes and hence difficult to scale up to large graphs.

2.2 Local Invariant Graph Networks

An entirely different strategy to building neural networks on graphs is using graph convolutional neural networks or message passing networks [kipf2016semi, gilmer2017neural]. We will refer to this class of methods as local graph networks (LIGNs). In their simplest form, these transform graph signals with a feature on each node (corresponding to a global order 1 tensor) by passing messages over the edges of the graph using a single shared linear transformation , as follows:


where is the set of edges of the graph. Such convolutional architectures are generally more computationally efficient compared to the global methods, as the computation cost of computing one linear transformation scales linearly with the number of edges.

Figure 2: Two regular graphs.

This model can be generalized into using different aggregation functions than the sum and having the messages also depend on instead of just [gilmer2017neural]. These constructions satisfy equivariance condition eq. 1, but have that the output is invariant under a permutation of its neighbours, which is the reason for the limited expressivity noted by [xu2018powerful]. For example, no invariant message passing network can discriminate between the two regular graphs in figure 2. Furthermore, if applied to the rectangular pixel grid graph of an image, it corresponds to applying a convolution with isotropic filters.

3 Natural Graph Networks

To overcome the limitations of existing message passing networks while remaining in the more computationally efficient regime, we propose a new kind of message passing network in which the weights depend on the structure of the graph. That is, we modify Eq. 2 as follows, for graph :


where the linear kernel can now differ per graph and per edge. Clearly, not all such kernels lead to equivariant networks. Defining the space of kernels that do is the goal of the remainder of this section.

3.1 Global and Local Graph Symmetries

We represent the nodes in a graph of nodes by the integers . The graph structure can then be encoded by a set of pairs of integers , with iff the graph contains a directed or undirected arrow . Graphs and are similar or isomorphic if an isomorphism exists, which is a bijection between node sets and such that . In other words, a graph isomorphism maps nodes to nodes and edges to edges. The node permutation in the prior discussion is actually an isomorphism of graphs, as the adjacency structure of the graph also changes implicitly under that permutation. A special kind of isomorphism is an isomorphism of a graph to itself. This is called an automorphism and is a permutation of nodes, such that the edge set remains invariant. By definition, the automorphisms form a group, called the automorphism group.

The equivariance constraint under permutations (eq. 1), becomes an equivariance constraint under graph isomorphisms, shown in figure 3. However, using graph isomorphisms to find constraints on the kernel used in our convolution in eq. 3 is not desirable, because they lead to global constraints, meaning that the space of allowed kernels for edge is affected by the structure of the graph far away from edge . Hence, it is natural to define a relaxed notion of symmetry that only depends on the local structure of the graph. Intuitively, we can think of the local structure of the graph around edge to be the context in which we transport information from node to node . When another edge has a similar context, we want to pass information similarly and get a weight sharing constraint.

Figure 3: Global equivariance under global isomorphism between two graphs. Colours denote feature values.

The local symmetries arise by choosing for each edge in graph , a neighbourhood , which is a subgraph of the ambient graph containing edge . Examples of neighbourhoods are the coloured subgraphs in the graph in figure 1. We can define isomorphisms between neighbourhoods and as graph isomorphisms that map edge of to edge of . Such graphs are sometimes referred to as edge-rooted graphs. In figure 1, the blue neighbourhood contains a local automorphism, mapping the neighbourhood to itself. Later, we will find that neighbourhood isomorphisms lead to weight sharing and neighbourhood automorphisms to constraints on the kernel.

We require that the local symmetries form a superset of the global symmetries, so that equivariance to local symmetries implies equivariance to global symmetries. Hence, for any global isomorphism , and any edge neighbourhood , a corresponding local isomorphism should exist, such that the restriction of to the neighbourhood, , equals . Many neighbourhood constructions that satisfy this consistency property are possible. For example, choosing as the neighbourhood of edge all nodes that are edges removed from or , for some non-negative integer , and all edges between these nodes. The example neighbourhoods in figure 1 are of this family with .

Similarly, we can define local symmetries for other graph substructures, such as nodes. For node , we pick a neighbourhood and define a local symmetry to be a isomorphism between node neighbourhoods and that map to . Again, we naturally desire that the neighbourhoods are constructed, so that global graph isomorphisms restrict to local neighbourhood isomorphisms.

Figure 4: Node neighbourhood gauge transformation.

3.2 Features

In order to have equivariant neural networks with expressive kernels, it is necessary that the feature vector at node itself transforms as node is mapped to node by some global symmetry, rather than remain invariant, as is done in invariant message passing networks. We can define a local notion of such a transformation rule by letting the feature vector transform under local node symmetries.

An example of such a node neighbourhood is given by the five numbered nodes around node in figure 4. The nodes in the node neighbourhood are given an arbitrary ordering, or gauge [cohen2019gauge]. Such a gauge can be seen as a bijection from the integers to the nodes in the neighbourhood: , where in the number of nodes in . Two gauges and are always related by some permutation , where is the permutation group, such that . The basis of the feature space depends on the arbitrary choice of gauge. Vector expressed in gauge , has coefficients in gauge , where is a group representation of . This example is shown in figure 4.

Now, if we have a local node isomorphism , we want to transport vector to for which we have to take the gauges and into account. This requires us to pick representations such that nodes with isomorphic neighbourhoods have the same representation, noting that . We can transport by , where is the required gauge transformation. In figure 5, we show how for LIGN, GEGN and our NGN the features of the nodes in a graph transform under a global graph isomorphism.

Figure 5: The feature transport by a global isomorphism between two graphs. For invariant local graph networks (LIGN), the node features permute, but each feature itself remains invariant. For global equivariant graph networks (GEGN), each node feature additionally transforms by the global isomorphism. In contrast, in our natural graph networks (NGN), the node features transform under the isomorphisms of the local neighbourhood of the node. Each node neighbourhood in a NGN requires an arbitrary ordering of nodes, or gauge, for which we here pick the order of the global node indices.

3.3 Local Equivariance

A kernel on edge is a map from vector space at and at . Local equivariance of this kernel means that if we have a local isomorphism of edges , the outcome is the same if we first transport the signal from to and then apply kernel or if we first apply kernel and then transport from to , as depicted in figure 6. Thus we require:


For an isomorphism between different edges, this implies that the kernel is shared. For each automorphism from an edge to itself, it results in a linear constraint the linear map should satisfy: , leading to a linear subspace of permissible kernels. The key theoretical result of NGNs is that local equivariance implies global equivariance, which is proven in appendix B:

Theorem 1.

For a collection of graphs with node and edge neighbourhoods, node feature representations and a kernel on each edge, such that (1) all neighbourhoods are consistent with global isomorphisms, (2) nodes with isomorphic neighbourhoods have the same representation, and (3) the kernels are locally equivariant, then the convolution operation (eq. 3) is equivariant to global isomorphisms.

In appendix C, we show when a NGN is applied to a regular lattice, which is a graph with a global transitive symmetry, the NGN is equivalent to a group equivariant convolutional neural network [cohen2016group], when the representations and neighbourhoods are chosen appropriately. In particular, when the graph is a square grid with edges on the diagonals, we recover an equivariant planar CNN with 3x3 kernels. Bigger kernels are achieved by adding more edges. When the graph is a grid on a locally flat manifold, such as a icosahedron or another platonic solid, and the grid is a regular lattice, except at some corner points, the NGN is equivalent to a gauge equivariant CNN [cohen2019gauge], except around the corners.

Figure 6: Two commuting diagrams the kernel should satisfy arising from local iso- & automorphisms.

4 Graph Neural Network Message Parameterization

Equivariance requires weight sharing only between edges with isomorphic neighbourhoods, so, in theory, one can use separate parameters for each isomorphism class of edge neighbourhoods to parameterize the space of equivariant kernels. In practice, graphs such as social graphs are quite heterogeneous, so that that few edges are isomorphic and few weights need to be shared, making learning and generalization difficult. This can be addressed by re-interpreting the message from to , , as a function of the edge neighbourhood and feature value at , potentially generalized to being non-linear in , and then letting be a neural network-based “message network”.

Equivariance can be guaranteed, even without explicitly solving kernel constraints for each edge in the following way. By construction of the neighbourhoods, the node feature can always be embedded into a graph feature of the edge neighbourhood . The resulting graph feature can then be processed by an appropriate equivariant graph neural network operating on , in which nodes and have been distinctly marked, e.g. by a additional feature. The output graph feature can be restricted to create a node feature at , which is the message output. The messages are then aggregated using e.g. summing to create the convolution output . This is illustrated in figure 7. It is proven in appendix B that the graph equivariance constraint on the message network ensures that the resulting message satisfies the local equivariance constraint eq. 3 and furthermore that if the graph network is a universal approximator of equivariant functions on the graph, any local equivariant kernel can be expressed in this way.

The selection of the type of graph feature and message network forms a large design space of natural graph networks. If, as in the example above, the node feature is a vector representation of the permutation of the node neighbourhood, the feature can be embedded into a invariant scalar feature of the edge neighbourhood graph by assigning an arbitrary node ordering to the edge neighbourhood and transporting from the node neighbourhood to the edge neighbourhood, setting a 0 for nodes outside the node neighbourhood. Any graph neural network with invariant features can subsequently be used to process the edge neighbourhood graph feature, whose output we restrict to obtain the message output at . As a simplest example, we propose GCN, which uses an invariant message passing algorithm, or Graph Convolutional Neural Network [kipf2016semi], on graph as message network.

Figure 7: Message passing as graph convolution. The node feature at can be embedded into a graph feature of the edge neighbourhood, to which any equivariant graph neural network can be applied. The output graph feature can be restricted to obtain the message from to , . The messages to are invariantly aggregated to form output feature .

5 Categorical Perspective

Equivariance constraints to global symmetries, such as eq. 1, are used widely in machine learning and have recently been extended to local symmetry groups, or gauge symmetry [cohen2019gauge]. However, these formalisms do not include the locally varying local symmetries of graphs and a more general language is needed. For this, we make use of category theory, originally developed in algebraic topology, but recently also used as a modelling tool for more applied problems [fong2018seven]. Its constructions give rise to an elegant framework for building equivariant message passing networks, which we call “Natural Networks”. In this section, we will sketch the key ingredients of natural networks. Details are provided in Appendix D, including how it subsumes prior work on equivariance on manifolds and homogeneous spaces. We refer a reader interested in learning more about category theory to Leinster2016basic and fong2018seven.

A (small) category consists of a set of objects and for each two objects, , a set of abstract (homo)morphisms, or arrows, between them. The arrows can be composed associatively into new arrows and each object has an identity arrow with the obvious composition behaviour. A map between two categories and is a functor , when it maps each object , an object and to each morphism in , a morphism in , such that . Given two functors , a natural transformation consists of, for each object , a morphism , such that for each morphism in , the following diagram commutes, meaning that the two compositions are the same:


We can model NGNs by first defining a category of edge neighbourhoods , whose morphisms are local graph isomorphisms. The functor maps edge neighbourhood to the feature space space at start node , , an object of the category of vector spaces and linear maps. maps edge morphism to the linear map that transports features from in gauge to in gauge , as described in section 3.2. Functor is similar, but then for the end node . Our kernel is “just” a natural transformation , as for each edge neighbourhood , it defines a linear map and the naturality condition eq. 5 specialises to exactly kernel constraint eq. 4. We see that by just defining a groupoid of local symmetries and two functors, a very general concept specialises to model our natural graph network kernels, an indication of the expressive power of applied category theory.

6 Related Work

As discussed in Section 2

, graph neural networks can be broadly classified into local (message passing) and global equivariant networks. The former in particular has received a lot of attention, with early work by

[Gori_Monfardini_Scarselli_2005, kipf2016semi]. Many variants have been proposed, with some influential ones including [gilmer2017neural, Velickovic_2018, Li_Tarlow_Brockschmidt_Zemel_2017]. Global methods include [hartford2018deep, maron2018invariant, maron2019provably, albooyeh2019incidence]. We note that in addition to these methods, there are graph convolutional methods based on spectral rather than spatial techniques [Bruna_Zaremba_Szlam_LeCun_2014, Defferrard_Bresson_Vandergheynst_2016, Perraudin_Defferrard_Kacprzak_Sgier_2018].

Covariant Compositional Networks (CCN) kondor2018covariant are most closely related to NGNs, as this is also a local equivariant message passing network. CCN also uses node neighbourhoods, node neighbourhood gauges and node features that are natural representations of the node neighbourhood gauge. CCNs are a special case of NGNs. When in a NGN (1) the node neighbourhood is chosen to be the receptive field of the node, so that the node neighbourhood grows in each layer, and (2) when the edge neighbourhood is chosen to be the node neighbourhood of , and (3) when the kernel is additionally restricted by the permutation group, rather just its subgroup the automorphism group of the edge neighbourhood, a CCN is recovered. These specific choices, make that the feature dimensions grow as the network gets deeper, which can be problematic for large graphs. Furthermore, as the kernel is more restricted, as only a subspace of equivariant kernels is used by CCNs.

Graph neural networks have found a wide range of applications, including quantum chemistry [gilmer2017neural], matrix completion [Berg_Kipf_Welling_2017], and modeling of relational data [Schlichtkrull_Kipf_Bloem_Berg_Titov_Welling_2017].

7 Experiments

Method Fixed Sym
GCN 96.17 96.17
Ours 98.82 98.82
Table 1: IcoMNIST results.
Icosahedral MNIST

In order to experimentally show that our method is equivariant to global symmetries, and increases expressiveness over an invariant message passing network (GCN), we classify MNIST on projected to the icosahedron, as is done in cohen2019gauge. In first column of table 1, we show accuracy when trained and tested on one fixed projection, while in the second column we test the same model on projections that are transformed by a random icosahedral symmetry. NGN outperforms the GCN and the equality of the accuracies shows the model is exactly equivariant. Experimental details can be found in A.

Graph Classification

We evaluate our model with GCN message parametrization on a standard set of 8 graph classification benchmarks from yanardag2015deep, containing five bioinformatics data sets and three social graph111These experiments were run on QUVA machines.. We use the 10-fold cross validation method as described by zhang2018end and report the best averaged accuracy across the 10-folds, as described by xu2018powerful, in table 2. Results from prior work is from maron2019provably. On most data sets, our local equivariant method performs competitively with global equiviarant methods [maron2018invariant, maron2019provably].

size 188 344 113 4110 4127 5000 1000 1500
classes 2 2 2 2 2 3 2 3
avg node # 17.9 25.5 39.1 29.8 29.6 74.4 19.7 14
GK [shervashidze2009efficient] 81.391.7 55.650.5 71.390.3 62.490.3 62.350.3 NA NA NA
RW [vishwanathan2010graph] 79.172.1 55.910.3 59.570.1 days NA NA NA NA
PK [neumann2016propagation] 762.7 59.52.4 73.680.7 82.540.5 NA NA NA NA
WL [shervashidze2011weisfeiler] 84.111.9 57.972.5 74.680.5 84.460.5 85.120.3 NA NA NA
FGSD [verma2017hunt] 92.12 62.80 73.42 79.80 78.84 80.02 73.62 52.41
AWE-DD [ivanov2018anonymous] NA NA NA NA NA 73.93 74.45 5.8 51.54
AWE-FB [ivanov2018anonymous] 87.879.7 NA NA NA NA 70.99 1.4 73.13 3.2 51.58 4.6
DGCNN [zhang2018end] 85.831.7 58.592.5 75.540.9 74.440.5 NA 73.760.5 70.030.9 47.830.9
PSCN [niepert2016learning](k=10) 88.954.4 62.295.7 752.5 76.341.7 NA 72.62.2 712.3 45.232.8
DCNN [atwood2016diffusion] NA NA 61.291.6 56.61 1.0 NA 52.110.7 49.061.4 33.491.4
ECC [simonovsky2017dynamic] 76.11 NA NA 76.82 75.03 NA NA NA
DGK [yanardag2015deep] 87.442.7 60.082.6 75.680.5 80.310.5 80.320.3 73.090.3 66.960.6 44.550.5
DiffPool [ying2018hierarchical] NA NA 78.1 NA NA 75.5 NA NA
CCN [kondor2018covariant] 91.647.2 70.627.0 NA 76.274.1 75.543.4 NA NA NA
Invariant Graph Networks [maron2018invariant] 83.8912.95 58.536.86 76.585.49 74.332.71 72.821.45 78.362.47 72.05.54 48.733.41
GIN [xu2018powerful] 89.45.6 64.67.0 76.22.8 82.71.7 NA 80.21.9 75.15.1 52.32.8
1-2-3 GNN [morris2019weisfeiler] 86.1 60.9 75.5 76.2 NA NA 74.2 49.5
PMP v1 [maron2019provably] 90.558.7 66.176.54 77.24.73 83.191.11 81.841.85 80.161.11 72.64.9 503.15
PMP v2 [maron2019provably] 88.887.4 64.77.46 76.395.03 81.212.14 81.771.26 81.381.42 72.24.26 44.737.89
PMP v2 [maron2019provably] 89.448.05 62.946.96 76.665.59 80.971.91 82.231.42 80.681.71 735.77 50.463.59
Ours (GCN) 89.391.60 66.841.79 71.711.04 82.371.35 83.98 1.89 NA 73.502.01 51.271.50
Rank 4th 2nd 15th 5th 1st NA 6th 5th
Table 2: Results on the Graph Classification dataset from yanardag2015deep

. Above the line are non-deep learning methods, below deep learning methods.

8 Conclusion

In this paper, we have developed a new framework for building neural networks that operate on graphs, which pass messages with kernels that depend on the local graph structure and have features that are sensitive to the direction of flow of information over the graph. Using elementary category theory, we define “natural networks”, a message passing method applicable to graphs, manifolds and homogeneous spaces. Our method is provably equivariant under graph isomorphisms and at the same time computationally efficient because it acts locally through graph convolutions. We evaluate one instance of natural graph networks using a message network on several benchmarks and find competitive results.

9 Broader Impact

The broader impact of this work can be analyzed in at least two different ways.

Firstly, graph neural networks in general are particularly suited for analyzing human generated data. This makes that powerful graph neural nets can provide tremendous benefit automating common business tasks. On the flip side, much human generated data is privacy sensitive. Therefore, as a research community, we should not solely focus on developing better ways of analyzing such data, but also invest in technologies that help protect the privacy of those generating the data.

Secondly, in this work we used some elementary applied category theory to precisely specify our problem of local equivariant message passing. We believe that applied category theory can and should be used more widely in the machine learning community. Formulating problems in a more general mathematical language makes it easier to connect disparate problem domains and solutions, as well as to communicate more precisely and thus efficiently, accelerating the research process. In the further future, we have hopes that having a better language with which to talk about machine learning problems and to specify models, may make machine learning systems more safe.


Appendix A Experimental details

Icosahedral MNIST

We use node and edge neighbourhoods with . We find the edge neighbourhood isomorphism classes and for each class, the generators of the automorphism group using software package Nauty. The MNIST digit input is a trivial feature, each subsequent feature is a vector feature of the permutation group, except for the last layer, which is again trivial. We find a basis for the kernels statisfying the kernel contstraint using SVD. The parameters linearly combine these basis kernels into the kernel used for the convolution. The trivial baseline uses trivial features throughout, with is equivalent to a simple Graph Convolutional Network. The baseline uses 6 times wider channels, to compensate for the smaller representations.

We did not optimize hyperparameters and have copied the architecture from


. We use 6 convolutional layers with output multiplicities 8, 16, 16, 23, 23 ,32, 64, with stride 1 at each second layer. After each convolution, we use batchnorm. Subsequently, we average pool over the nodes and use 3 MLP layers with output channels 64, 32 and 10. We use the Adam optimizer with learning rate 1E-3 for 200 epochs. Each training is on one NvidiaV100 GPU with 32GB memory and lasts about 2 hours.

Different from the results in the IcoCNN paper, we are equivariant to full icosahedral symmetry, including mirrors. This harms performance in our task. Further differnt is that we use an icosahedron with 647 nodes, instead of 2.5k nodes, and do not reduce the structure group, so for all non-corner nodes, we use a 7 dimensional representation of , rather than a regular 6D representation of .

Graph Classification

For the graph classification experiments, we again use node and edge neighbourhoods with . This time, we use a GCN message network. At each input of the message network, we add two one-hot vectors indicating and

. The bioinformatics data sets have as initial feature a one-hot encoding of a node class. The others use the vertex degree as initial feature.

We use the 10-fold cross validation method as described by zhang2018end

. On the second fold, we optimize the hyperparameters. Then for the best hyperparams, we report the averaged accuracy and standard deviation across the 10-folds, as described by

xu2018powerful. We train with the Adam optimizer for 1000 epochs on one Nvidia V100 GPU with 32GB memory. The slowest benchmark took 8 hours to train.

We use 6 layers and each message network has two GCN layers. All dimensions in the hidden layers of the message network and between the message networks are either 64 or 256. The learning rate is either 1E-3 or 1E-4. The best model for MUTAG en PTC used 64 channels, for the other datasets we selected 256 channels. For IMDB-BINARY and IMDB-MULTI we selected learning rate 1E-3, for the others 1E-4.

Appendix B Definitions and Proofs

In order to prove our main theorem, we first briefly re-define all relevant concepts.

Definition 1: Node Neighbourhoods.

For a collection of graphs, a node neighbourhood selection selects for each graph for each node in , a subgraph of , such that for any graph isomorphism and any node in , the image of the restriction of to , equals the node neighbourhood of node in .

Definition 2: Local Node Isomorphism.

For a collection of graphs with a node neighbourhood selection, the node groupoid is the category having as objects the node neighbourhoods of the graphs and as morphisms the graph isomorphisms between the node neighbourhoods and mapping to . We call the morphisms in this category local node isomorphisms. All morphisms are isomorphisms, making the category a groupoid. When such a morphism exists between two node neighbourhoods, we call them isomorphic.

Definition 3: Edge Neighbourhood, Local Edge Isomorphism.

Similar to the above, we can define an edge neighbourhood selection, the edge groupoid and local edge isomorphisms. We require that each edge neighbourhood is a supergraph of node neighbourhoods and and that the image of local edge isomorphism , restricted to node neighbourhood , equals node neighbourhood , and when restricted to , equals .

Remark 1: .

By construction, any (global) graph isomorphism restricts for each edge to a local edge isomorphism and for each node to a local node isomorphism . Also, each local edge isomorphism restricts to a local node isomorphism for and for .

Definition 4: Node Gauge.

For node neighbourhood , a node gauge is bijection , where is the number of nodes in and is the set . Such a node gauge amounts to an ordering of the nodes in the node neighbourhood. A node gauge field of a graph is a node gauge at each node.

Definition 5: Node Features.

A node feature space for node neighbourhood is a choice of a representation of , the symmetric group over symbols, with representation vector space . A node feature in node gauge is an element of . In gauge , differing from by , with , has coefficients . A node feature field on a graph is a node feature at each node.

Definition 6: Local Push-forward.

Given a local node isomorphism , node gauges and , node feature spaces such that , we can define the local push-forward as .

Definition 7: Global Push-forward.

Given global isomorphism with gauge fields , if for all nodes in , we have that , we define the global push-forward of an entire node feature field of to feature field of by .

Definition 8: Message Passing Kernel.

A message passing kernel for a collection of graphs, is for each graph , for each edge a function , linear or otherwise. defines a graph convolution , mapping feature field of graph to feature field , with .

Definition 9: Global Equivariance.

A message passing kernel is globally equivariant if for any global graph isomorphism , we have for each edge in , that , where , which by construction are local node isomorphisms.

Definition 10: Local Equivariance.

A message passing kernel is locally equivariant if for any local edge isomorphism , we have that , where , which by construction are local node isomorphisms.

Remark 2: .

We can resolve the local equivariance constraint a bit further, by noting that some local edge isomorphisms are between the same edge neighbourhood , i.e. are automorphisms, and some are between different edges. Such a local edge automorphism leads to constraint on . The group of automorphisms of graph

can be found using e.g. the software package Nauty, which has worst case computational cost exponential in the size of the neighbourhood graph. If the kernel is linear, each automorphism leads to a linear constraint, giving a linear subspace of kernels satisfying the constraints, which can be found by singular value decomposition. The local equivariance constraint due to local edge isomorphisms between different edges

and relates kernels . For linear kernels, this makes that is fully determined by .

Lemma 1.

A locally equivariant message passing kernel is globally equivariant.


As mentioned in remark 1, by construction, any global isomorphism restricts for each edge to a local edge isomorphism, to which the kernel is equivariant. ∎

Lemma 2.

The graph convolution of a globally equivariant message passing kernel commutes with global push-fowards.


Let . Then:

where we use successively the global equivariance of , the linearity of the push-forward and the fact that gives a bijection between and . ∎

Theorem 2.

For a collection of graphs with node and edge neighbourhoods, node feature representations and a kernel on each edge, such that (1) all neighbourhoods are consistent with global isomorphisms, (2) nodes with isomorphic neighbourhoods have the same representation matrices, and (3) the kernels are locally equivariant, then the convolution operation (eq. 3) is equivariant to global isomorphisms.


Corollary of the above two lemmas. ∎

Remark 3: .

Global equivariance requires that each global isomorphism reduces for each edge to a local edge isomorphism. The converse is not true, as not all local edge isomorphisms correspond to a global isomorphism. Hence, locally equivariant kernels are more strongly constrained than globally equivariant kernels. The advantage of using local equivariance over global equivariance to construct the kernels is twofold: it may be a desirable modeling decision to share weights on edges that are locally similar, finding automorphisms and hence kernel constraints is computationally easier for the smaller edge neighbourhood, compared to the larger global graph.

Remark 4: .

An obvious generalization, which we use widely in practice, is to have the representations of the input of an equivariant kernel be different from the representation of the output. The above lemmas trivially generalize to that case.

b.1 Message Networks

The constraints on locally equivariant message passing kernels only depend on the graph structure in the edge neighbourhood. Hence, we can write such a kernel as a function , where is the edge neighbourhood graph, in which nodes and are uniquely marked, so that the edge can be identified.

Definition 11: Edge Gauge, Edge Feature.

Similar to a node gauge, we can define an edge gauge, which orders the edge neighbourhood. An edge feature space is a representation of , where . An edge feature is an element in the edge feature space in some edge gauge .

The edge neighbourhood graph can be represented as an adjacency matrix given some edge gauge , making it an edge feature with the matrix representation of . The unique marking of and can be represented as two one-hot vectors, each an edge feature in the vector representation of . Taking the direct sum of the vectors and adjacency matrix, we get edge feature in gauge and in representation of , encoding the local graph structure. It is easy to see that this edge feature is constant under the push-forward of a local edge isomorphism: .

Definition 12: Node Feature Embedding / Projection.

Given node feature spaces and edge feature space and gauges , an embedding of the node feature into an edge feature in these gauges is an injective linear map . Similarly, a projection of an edge feature into a node feature is a surjective linear map . In different gauges , we get and . The embeddings and projections must be shared along isomorphic edges. That is, if we have local edge isomorphism they should commute with the push-forward: and .

An example of a node feature embedding, for vector representations of the permutation groups of the node and edge, is , linearly extended to a matrix. An example projection for those representations is , linearly extended to a matrix.

Definition 13: Message Network.

A message network for a set of graphs, given edge features, consists of, for each edge , a map . should be equivariant, so that for any and must not depend on the gauge. We require that for isomorphic edges and , . A message network, combined with gauges, node features, feature embeddings and projections, induces a message passing kernel .

If the edge features are vector, matrix, or higher order tensor representations of the permutation group, we can make a particularly practical message network, which is shared among all edges: , as the parametrization of equivariant maps between such permutation representations is independent of the number of nodes in the edges neighbourhood [maron2018invariant]. Such networks coincide exactly with the Global Equivariant Graph Networks introduced earlier. We call these shared message networks.

Lemma 3.

The kernel induced by a message network is locally equivariant.


Let be a local edge isomorphism and be the restrictions to local node isomorphisms.

where successively we used constancy of and commutation of , equivariance of , which implies commutation with the push-forward, and commutation of . ∎

Lemma 4.

Any locally equivariant message passing kernel can be expressed by the kernel induced by a shared message network, assuming the shared message network is universal.


Given any locally equivariant kernel , we can construct constraints on such that its induced kernel matches . When these constraints are mutually compatible, compatible with the equivariance constraint and can express any equivariant map, a exist that matches these constraints. For each edge isomorphism class, we pick one representative and require that . Where and define right and left inverses of and , which exist as they are linear surjections and injections respectively. Clearly, the kernel induced by a satisfying this constraint matches on edge . By the above lemma, it also matches on all isomorphic edges. The constraints on arising from two non-isomorphic edges , are independent, as there exist no such that . ∎

Appendix C Reduction to Group & Manifold Gauge Equivariance

The two dimensional plane has several regular tilings. These are graphs with a global symmetry that maps transitively between all faces, edges and nodes of the graph. For such a tiling with symmetry group , for some point group and translation group , we can show that when the neighbourhood sizes and representations are chosen appropriately, the natural graph network is equivalent to a Group Equivariant CNN of group cohen2016group. In order to do so, we must first be able to have node features that are not representations of the permutation group, but of .

Definition 14: Reduction of the Structure Group.

For node neighbourhood , the local isomorphisms transform one gauge into a gauge at . This allows us to reduce the space of gauges, by, at one node in each node isomorphism class, picking an arbitrary gauge and defining the gauge at all nodes in that isomorphism class though picking a local isomorphism and setting . Any two isomorphisms are always related by an automorphism s.t. . Thus, the gauges at induced by these local isomorphisms, are related by . Hence, by defining , the gauges induced by isomorphisms at the nodes isomorphic to , which we call a reduced set of gauges, are related by a subgroup of , generated by for each local node automorphism . Clearly, is isomorphic to the automorphism group of .

Given such a reduced set of gauges, we note that for any local node isomorphism , where gauge is induced by isomorphism and gauge by isomorphism , the push-forward is done by gauge transformation , which is an element of , as is a local automorphism. As of the representation of , only the subgroup is used, we can just as well start with a representation of , instead of . We call such features node features with reduced structure group.

Note that this reduction of the structure group is different from the reduction mentioned in cohen2019gauge, as our reduction of to assumes no additional structure on the graph. If we do assume some additional structure, we can reduce the structure group further to a subgroup of , which corresponds to the reduction of the structure group in cohen2019gauge.

Now, as an example consider one of the tilings of the plane, the triangular tiling. As shown in figure 8, the node neighbourhood has as automorphism group the dihedral group of order 6, , so we can use features with reduced structure group . The kernel is constrained by one automorphism, which mirrors along the edge. A Natural Graph Network on these reduced features is exactly equivalent to HexaConv [hoogeboom2018hexaconv]. Furthermore, the convolution is exactly equivalent to the Icosahedral gauge equivariant CNN [cohen2019gauge] on all edges that do not contain a corner of the icosahedron. A similar equivalence can be made for the square tiling and a conventional planar group equivariant CNN [cohen2016group] and a gauge equivariant CNN on the surface of a cube.

Figure 8: Node and edge neighbourhood on a triangular tiling.

Appendix D Category-theoretical formalization

Our goal is to define a general construction for equivariant message passing. It should be applicable to our graph case, as well as equivariant homogeneous and manifold convolutions. Some notation may differ from the previous sections.

The main ingredients are:

  1. We need a “data groupoid” . The objects of are location where we will have data. The morphisms are “paths” along which we can transport data. On manifolds, this is the path groupoid.

  2. We need a “message groupoid” . The objects are the messages we pass during the convolution. The morphisms are the symmetries of the theory. For each morphism, we will later get a linear equation relating the message passing kernels.

  3. We need a pair of functors , mapping messages to the start/end data location. We write , for start/tail of the message. Symmetries of the messages are mapped to transportation of the data.

  4. We need the principal groupoid . For each , we have a pair . is a group, is a set on which has a free and transitive right action. Morphisms are pairs , such that is a group homomorphism and is equivariant, so that , . For notational simplicity, we just write and omit the group.

  5. We also need a transport functor [schreiber2007parallel] mapping paths to equivariant principal fiber maps.

  6. We need a category of associated vector spaces . This is specified by picking for each object of a vector space with a representation of . Then the objects of are . These can be seen as a space of equivariant functions . The morphisms of are linear maps. Such a morphism can be seen as equivariant maps to matrices such that , we have:

  7. If we have that for all such that , we have that , then we automatically have a functor . It maps objects and morphism to morphism .

    Otherwise, we need to specify intertwiners and have .

We can compose the functors to achieve two functors and denote these respectively . Now a kernel is “just” a natural transformation between these functors . That is, for each message , a morphism , such that, for all edge symmetries (= morphisms in )