Spectral Multigraph Networks for Discovering and Fusing Relationships in Molecules

11/23/2018 ∙ by Boris Knyazev, et al. ∙ University of Guelph SRI International 0

Spectral Graph Convolutional Networks (GCNs) are a generalization of convolutional networks to learning on graph-structured data. Applications of spectral GCNs have been successful, but limited to a few problems where the graph is fixed, such as shape correspondence and node classification. In this work, we address this limitation by revisiting a particular family of spectral graph networks, Chebyshev GCNs, showing its efficacy in solving graph classification tasks with a variable graph structure and size. Chebyshev GCNs restrict graphs to have at most one edge between any pair of nodes. To this end, we propose a novel multigraph network that learns from multi-relational graphs. We model learned edges with abstract meaning and experiment with different ways to fuse the representations extracted from annotated and learned edges, achieving competitive results on a variety of chemical classification benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNNs) have seen wide success in domains where data is restricted to a Euclidean space. These methods exploit properties such as stationarity of the data distributions, locality and a well-defined notation of translation, but cannot model data that is non-Euclidean in nature. Such structure is naturally present in many domains, such as chemistry, physics, social networks, transportation systems, and 3D geometry, and can be expressed by graphs bronstein2017geometric ; hamilton2017representation . By defining an operation on graphs analogous to convolution, Graph Convolutional Networks (GCNs) have extended CNNs to graph-based data. The earliest methods performed convolution in the spectral domain bruna2013spectral , but subsequent work has proposed generalizations of convolution in the spatial domain. There have been multiple successful applications of GCNs to node classification velickovic2017graph and link prediction schlichtkrull2018modeling , whereas we target graph classification similarly to simonovsky2017dynamic .

Our focus is on multigraphs, a graph that is permitted to have multiple edges. Multigraphs are important in many domains, such as chemistry and physics. The challenge of generalizing convolution to graphs and multigraphs is to have anisotropic convolution kernels (such as edge detectors). Anisotropic models, such as MoNet monti2017geometric and SplineCNN fey2018splinecnn , rely on coordinate structure, work well for vision tasks, but are suboptimal for non-visual graph problems. Other general models exist gilmer2017neural ; battaglia2018relational , but making them efficient for a variety of tasks conflicts with the “no free lunch theorem”.

Compared to non-spectral GCNs, spectral models have filters with more global support, which is important for capturing complex relationships. We rely on Chebyshev GCNs (ChebNet) defferrard2016convolutional that enjoy an explicit control of receptive field size. Even though it was originally derived from spectral methods bruna2013spectral , it does not suffer from their main shortcoming — sensitivity of learned filters to graph size and structure.

Contributions: We propose a scalable spectral GCN that learns from multigraphs by capturing multi-relational graph paths as well as multiplicative and additive interactions to reduce model complexity and learn richer representations. We also learn new abstract relationships between graph nodes, beyond the ones annotated in the datasets. To our knowledge, we are the first to demonstrate that spectral methods can efficiently solve problems with variable graph size and structure, where this kind of method is generally believed not to perform well.

2 Multigraph Convolution

While we provide the background to understand our model, a review of spectral graph methods is beyond the scope of this paper. Section 6.1 of the Appendix reviews spectral graph convolution.

2.1 Approximate spectral graph convolution

We consider an undirected, possibly disconnected, graph with nodes, , and edges, , having values in range . Nodes usually represent specific semantic concepts such as atoms in a chemical compound or users in a social network. Nodes can also denote abstract blocks of information with common properties, such as superpixels in images. Edges define the relationships between nodes and the scope over which node effects may propagate.

In spectral graph convolution bruna2013spectral , the filter is defined on an entire input space. Although it makes filters global, which helps to capture complex relationships, it is also desirable to have local support since the data often have local structure and since we want to learn filters independent on the input size to make the model scalable.

To address this issue, we can model this filter as a function of eigenvalues

(which is assumed to be constant) of the normalized symmetric graph Laplacian : . We can then approximate it as a sum of terms using the Chebyshev expansion, where each term contains powers . Finally, we apply the property of eigendecomposition:


By combining this property with the Chebyshev expansion of

, we exclude eigenvectors

, that are often infeasible to compute, from spectral graph convolution, and instead express the convolution as a function of graph Laplacian . In general, for the input with nodes and -dimensional features in each node, the approximate convolution is defined as:


where are features projected onto the Chebyshev basis and concatenated for all orders and are trainable weights, where .

This approximation scheme was proposed in defferrard2016convolutional , and Eq. 2 defines the convolutional layer in the Chebyshev GCN (ChebNet), which is the basis for our method. Convolution is an essential computational block in graph networks, since it permits the gradual aggregation of information from neighboring nodes. By stacking the operator in Eq. 2, we capture increasingly larger neighborhoods and learn complex relationships in graph-structured data.

2.2 Graphs with variable structure and size

The approximate spectral graph convolution (Eq. 2) enforces spatial locality of the filters by controlling the order of the Chebyshev polynomial . Importantly, it reduces the computational complexity of spectral convolution from to , making it much faster in practice assuming the graph is sparsely connected and sparse matrix multiplication is implemented. In this work, we observe an important byproduct of this scheme: that learned filters become less sensitive to changes in graph structure and size due to excluding the eigenvectors from spectral convolution, so that learned filters are not tied to .

(a) ENZYMES (b) MUTAG (c) NCI1
Figure 1: Histograms of eigenvalues of the rescaled graph Laplacian for the (a) ENZYMES, (b) MUTAG and (c) NCI1 datasets. Due to the property of eigendecomposition () the distribution of eigenvalues shrinks when we take powers of to compute the approximate spectral graph convolution (Eq. 2).

The only assumption that still makes a trainable filter sensitive to graph structure is that we model it as a function of eigenvalues . However, the distribution of eigenvalues of the normalized Laplacian is concentrated in a limited range, making it a weaker dependency on graphs than the spectral convolution via eigenvectors, so that learned filters generalize well to new graphs. Moreover, since we use powers of in performing convolution (Eq. 2), the distribution of eigenvalues further contracts due to exponentiation of the middle term on the RHS of Eq. 1. We believe that this effect accounts for the robustness of learned filters to changes in graph size or structure (Figure 1).

2.3 Graphs with multiple relation types

In the approximate spectral graph convolution (Eq. 2), the graph Laplacian encodes a single relation type between nodes. Yet, a graph may describe many types of distinct relations. In this section, we address this limitation by extending Eq. 2 to a multigraph, i.e. a graph with multiple () edges (relations) between the same nodes encoded as a set of graph Laplacians , where is an upper bound on the number of edges per dyad. Extensions to a multigraph can also be applied to early spectral models bruna2013spectral but, since ChebNet was shown to be superior in downstream tasks, we choose to focus on the latter model.

Two dimensional Chebyshev polynomial.

The Chebyshev polynomial used in Eq. 2 (see Section 6.1 in Appendix for detail) can be extended for two variables (relations in our case) similarly to bilinear models, e.g. as in omar2010two :


and, analogously, for more variables. For , the convolution is then defined as:


where . In this case, we allow the model to leverage graph paths consisting of multiple relation types (Figure 2). This flexibility, however, comes at a great computational cost, which is prohibitive for a large number of relations or large order due to exponential growth of the number of parameters: . Moreover, as we demonstrate in our experiments, such multi-relational paths do not necessary lead to better performance.

(a) (b)
Figure 2: Comparison of (a) the fusion method based on a two-dimensional (2d) Chebyshev polynomial (Eq. 34) to (b) other proposed methods in case of a 2-hop filter (a filter averaging features of nodes located two edges away from the filter center - in this case). Note that (a) can leverage multi-relational paths and the filter centered at node can access features of the node , which is not possible for other methods (b). In this work, edge type can denote annotated relations, while can denote learned ones (Eq. 7). We also allow for three and more relation types.

Multiplicative and additive fusion.

Motivated by multimodal fusion considered in the Visual Question Answering literature (e.g. kim2016hadamard ), we propose the multiplicative operator:


where is a learnable differentiable transformation for relation type and are features projected onto the Chebyshev basis . In this case, node features interact in a multiplicative way. The advantage of this method is that it can learn separate for each relation and has fewer trainable parameters preventing overfitting, which is especially important for large and . The element-wise multiplication in Eq. 5 can be replaced with summation to perform additive fusion.

Shared projections.

Another potential strength of the approach in Eq. 5 is that we can further decrease model complexity by sharing parameters of between the relation types, so that the total number of trainable parameters does not depend on the number of relations . Despite useful practical properties, as we demonstrate in the experiments, it is usually hard for a single shared to generalize between different relation types.

Concatenating edge features.

A more straightforward approach is to concatenate features for all relation types and learn a single matrix of weights :


This method, however, does not scale well for large , since the dimensionality of grows linearly with . Note that even though multi-relational paths are not explicit in Eq. 5 and 6, for a multilayer network, relation types will still communicate through node features. In Figure 2, node will contain features of node after the first convolutional layer, so that in the second layer the filter centered at node will have access to features of node by accessing features of node . Compared to the 2d polynomial convolution defined by Eq. 4, the concatenation-based, multiplicative and additive approaches require more layers to have a larger multi-relational receptive field.

3 Multigraph Convolutional Networks

A frequent assumption of current GCNs is that there is at most one edge between any pair of nodes in a graph. This restriction is usually implied by datasets with such structure, so that in many datasets, graphs are annotated with the single most important relation type, for example, whether two atoms in a molecule are bonded wale2008comparison ; duvenaud2015convolutional . Meanwhile, data is often complex and nodes tend to have multiple relationships of different semantic, physical, or abstract meanings. Therefore, we argue that there could be other relationships captured by relaxing this restriction and allowing for multiple kinds of edges, beyond those annotated in the data.

3.1 Learning edges

Prior work (e.g. schlichtkrull2018modeling ; bordes2013translating ), proposed methods to learn from multiple edges, but similarly to the methods using a single edge type kipf2016semi , they leveraged only predefined (annotated) edges in the data. We devise a more flexible model, which, in addition to learning from an arbitrary number of predefined relations between nodes (see Section 2.3), learns abstract edges jointly with a GCN. We propose to learn a new edge between any pair of nodes and with features and using a trainable similarity function:


where the softmax is used to enforce sparse connections and

can be any differentiable function such as a multilayer perceptron in our work. This idea is similar to 

henaff2015deep , built on the early spectral convolution model bruna2013spectral , which learned an adjacency matrix, but targeted classification tasks for non graph-structured data (e.g. document classification, with each document is represented as a feature vector). Moreover, we learn this matrix jointly with a more recent graph classification model defferrard2016convolutional and, additionally, efficiently fuse predefined and learned relations. Eq. 7 is also similar to that of velickovic2017graph , which used this functional form to predict an attention coefficient for some existing edge

. The attention model can only strengthen or weaken some existing relations, but cannot form new relations. We present a more general model that makes it possible to connect previously disconnected nodes and form

new abstract relations. To enforce a symmetry of predicted edges we compute an average: .

3.2 Layer pooling versus global pooling

Inspired by convolutional networks, previous works bruna2013spectral ; defferrard2016convolutional ; monti2017geometric ; simonovsky2017dynamic ; fey2018splinecnn built an analogy of pooling layers in graphs, for example, using the Graclus clustering algorithm dhillon2007weighted . In CNNs, pooling is an effective way to reduce memory and computation, particularly for large inputs. It also provides additional robustness to local deformations and leads to faster growth of receptive fields. However, we can build a convolutional network without any pooling layers with similar performance in a downstream task springenberg2014striving — it just will be relatively slow, since pooling is extremely cheap on regular grids, such as images. In graph classification tasks, the input dimensionality, which corresponds to the number of nodes , is often very small () and the benefits of pooling are less clear. Graph pooling, such as in dhillon2007weighted , is also computationally intensive since we need to run the clustering algorithm for each training example independently, which limits the scale of problems we can address. Aiming to simplify the model while maintaining classification accuracy, we exclude pooling layers between conv. layers and perform global maximum pooling (GMP) over nodes following the last conv. layer. This fixes the size of the penultimate feature vector regardless of the number of nodes (Figure 3).

4 Experiments

4.1 Dataset details

We evaluate our model on five chemical graph classification datasets frequently used in previous work: NCI1 and NCI109 wale2008comparison , MUTAG debnath1991structure , ENZYMES schomburg2004brenda , and PROTEINS borgwardt2005protein . For each dataset, there is a set of graphs with an arbitrary number of nodes and undirected binary edges of a single type () and each graph has a single, categorical label that is to be predicted. Dataset statistics are presented in Table 2 of the Appendix.

Every graph represents some chemical compound labeled according to its functional properties. In NCI1, NCI109 and MUTAG, edges correspond to atom bonds (types of bonds) and vertices - to atom properties; in ENZYMES, edges are formed based on spatial distance (edges connect nodes if those are neighbors along the amino acids sequence or if they are neighbors in space within the protein structure borgwardt2005protein ); in PROTEINS, edges are similarly formed based on spatial distance between amino acids in proteins. Since edges in these datasets are not directly related to the features of nodes they connect, we expect that learned edges will enrich graph structure and improve graph classification.

Node features are discrete in these datasets and represesented as one-hot vectors of length . We do not use any additional node or edge attributes available for some of these datasets.

These datasets vary in the number of graphs (188 - 4127), class labels (2 - 6) and the number of nodes in a graph (2 - 620) and, thereby, represent a comprehensive benchmark for our method. We follow the standard approach to evaluation shervashidze2011weisfeiler ; yanardag2015deep

and perform 10-fold cross-validation on these datasets. To minimize any random effects, we repeat experiments 10 times and report average classification accuracies together with standard deviations.

4.2 Architectural details and experimental setup

Figure 3: Graph classification pipeline. Each convolutional layer in our model takes the graph and returns a graph with the same nodes and edges. Node features become increasingly global after each subsequent layer as the receptive field increases, while edges are propagated without changes. As a result, after several graph convolutional layers, each node in the graph contains information about its neighbors and the entire graph. By pooling over nodes we summarize the information collected by each node. Fully-connected layers follow global pooling to perform classification. Dashed orange edges denote connections learned as described in Section 3.1.

In all experiments, we train a ChebNet with three graph convolutional layers followed by global max pooling (GMP) and 2 fully-connected layers (Figure 


). Batch normalization (BN) and the ReLU activation are added after each layer, whereas dropout is added only before the fully-connected layers. Projections

in Eq. 5 are modeled by a single layer neural network with hidden units and the activation. The edge prediction function (see Eq. 7, Section 3.1) is a two layer neural network with 128 hidden units (32 for PROTEINS), which acts on concatenated node features. Detailed network architectures are presented in Table 2 of the Appendix.

We train all models using the Adam optimizer kingma2014adam

with learning rate of 0.001, weight decay of 0.0001, and batch size of 32, the learning rate is decayed after 25, 35, and 45 epochs and the models are trained for 50 epochs as in 

simonovsky2017dynamic . We run experiments for different fusion methods (Section 2.3) and Chebyshev orders in range from 2 to 6 (Section 2.1) and report the best results in Table 1.

WL shervashidze2011weisfeiler 84.60.4 84.50.2 83.81.5 59.11.1
WL-OA kriege2016valid 86.10.2 86.30.2 84.51.7 59.91.1 76.40.4
structure2vec dai2016discriminative 83.7 82.2 88.3 61.1
DGK yanardag2015deep 80.30.5 80.30.3 87.42.7 53.40.9 75.70.5
PSCN niepert2016learning 78.61.9 92.64.2 75.92.8
ECC simonovsky2017dynamic 83.8 82.1 88.3 53.5
DGCNN zhang2018end 74.40.5 85.81.7 76.30.2
Graph U-Net cangea2018towards 64.2 75.5
DiffPool ying2018hierarchical 62.5 76.3
MoNet monti2017geometric - ours* 69.80.2 70.00.3 84.21.2 36.41.2 71.91.2
GCN kipf2016semi - ours* 75.80.7 73.40.4 76.51.4 40.71.8 74.30.5
ChebNet defferrard2016convolutional - ours* 83.10.4 82.10.2 84.41.6 58.01.4 75.50.4
Multigraph ChebNet 83.40.4 82.00.3 89.11.4 61.71.3 76.50.4
Table 1: Chemical graph classification results (average accuracy and standard deviation in %). Multigraph ChebNet obtains better results by leveraging two types of edges: annotated and learned, whereas all other models use only annotated edges. *We implemented MoNet, GCN and ChebNet. To make a fair comparison to Multigraph ChebNet, we use the same network architectures, batch-normalization, global max pooling. For MoNet, coordinates are defined using node degrees as in monti2017geometric . The top result across all methods for each dataset is bolded.

4.3 Results

Previous works typically show strong performance on one or two datasets out of five that we use (Table 1). In contrast, the Multigraph ChebNet, leveraging two relation types (annotated and learned, see Section 3.1), shows high accuracy across all datasets. On PROTEINS we outperform all previous methods, while on ENZYMES two recent works based on differentiable pooling ying2018hierarchical ; cangea2018towards are better, however it is difficult to compare to their results without the standard deviation of accuracies. We also obtain competitive accuracy on NCI1 outperforming DGK yanardag2015deep , PSCN niepert2016learning , and DGCNN zhang2018end . Importantly, the Multigraph ChebNet with two edge types, i.e. predefined dataset annotations and the learned edges (Section 3.1) consistently outperforms the baseline ChebNet with a single edge, which shows efficacy of our approach and demonstrates the complementary nature of predefined and learned edges. Lower results on NCI1 and NCI109 can be explained by the fact that the node features in the graphs of these datasets are imbalanced with some features appearing only a few times in the dataset. This is undesirable for our method, which learns new edges based on features and the model can predict random values for unseen features. On MUTAG we surpass all but one method niepert2016learning . But in this case the dataset is tiny, consisting of 188 graphs and the margin from the top method is not statistically significant.

Evaluation of edge fusion methods.

We train a model using each of the edge fusion methods proposed in Section 2.3 and report the summary of results in Figure 4. We count the number of times each method outperforms the others treating all 10 folds independently. As expected, graph convolution based on the two-dimensional Chebyshev polynomial is better for lower orders of , since it exploits multi-relational graph paths, effectively increasing the receptive field of filters. However, for larger , the model complexity becomes too high due to quadratic growth of the number of parameters and performance degrades. Sharing weights for multiplicative or additive fusion generally drops performance with a few exceptions in the multiplicative case. This implies that predefined and learned edges are of a different nature. It would be interesting to validate these fusion methods on a larger number of relation types.

Fusion method # of parameters
Single edge* (Eq. 2)
Concat (Eq. 6)
2d Cheb (Eq. 4)
Multiply (Eq. 5)
Sum (Eq. 5)
M-shared (Eq. 5)
S-shared (Eq. 5)
Figure 4: Comparison of edge fusion methods for 10 folds. We observe that some methods perform well for lower order , such as 2d Chebyshev convolution winning in 18/50 cases for , while others perform better for higher , such as +Multiply-shared+. +Multiply+ generally performs well across different . All fusion methods, except for +Sum+ and +Sum-shared+, consistently outperform the +Single-edge+ baseline. We also show the number of parameters in for the Chebyshev convolution layer depending on the number of input features , number of output features , number of relation types , order and some constant (number of hidden units in a projection layer in Eq. 5). *+Single-edge+ denotes using only annotated edges. All other methods additionally use the second edge, learned based on node features.

Speed comparison.

We compare forward pass speed of the proposed method to the baseline ChebNet, MoNet monti2017geometric and GCN kipf2016semi (Figure 5). Analgously to kipf2016semi , we generate random graphs with nodes and edges. ChebNet with 2 edge types (Multigraph ChebNet) is on average two times slower than the baseline ChebNet with a single annotated edge. MoNet is in turn two times slower than Multigraph ChebNet. Multigraph ChebNet with 2d edge fusion is the slowest due to exponential growth of parameters, while GCN is the fastest, although the gap with the baseline ChebNet is small. Therefore, we believe that Multigraph ChebNet with certain edge fusion methods provides a relatively fast, scalable and accurate model.

Figure 5: Speed comparison of the baseline ChebNet, ChebNet with two edge types (Multigraph ChebNet), MoNet monti2017geometric and GCN kipf2016semi . MoNet-N refers to MoNet with filters.

5 Related work and Discussion

Our method relies on a fast approximate spectral graph convolution known as ChebNet defferrard2016convolutional ), which was designed for graph classification. A simplified and faster version of this model, Graph Convolutional Networks (GCN) kipf2016semi , which is practically equivalent to the ChebNet with order

, has shown impressive node classification performance on citation and knowledge graph datasets in the transductive learning setting. In all our experiments, we noticed that using more global filters (with larger

) is important (Tables 1). Other recent works hamilton2017inductive ; velickovic2017graph also focus on node classification and, therefore, are not empirically compared to in this work.

Recent work of ying2018hierarchical proposed a differentiable alternative to clustering-based graph pooling, showing strong results in graph classification tasks, but at the high computational cost. To alleviate this, a more scalable approach based on dropping nodes graphunet2018 ; cangea2018towards was introduced and can be integrated with our method to further improve results.

Closely related to our work, monti2017geometric formulated the generalized graph convolution model (MoNet) based on a trainable transformation to pseudo-coordinates, which led to learning anisotropic kernels and excellent results in visual tasks. However, in non-visual tasks, when coordinates are not naturally defined, the performance is worse (Table 1). Notably, the computational cost (both memory and speed) of MoNet is higher than for ChebNet due to the patch operator in (monti2017geometric, , Eq. (9)-(11)) (Figure 5). The argument in favor of MoNet against ChebNet was the sensitivity of spectral convolution methods, including ChebNet, to changes in graph size and structure. We contradict this argument and show superior performance on chemical graph classification datasets. SplineCNN fey2018splinecnn

is similar to MoNet and is good at classifying both graphs and nodes, but it is also based on pseudo coordinates and, therefore, potentially has the same shortcoming of MoNet. So, its performance on general graph classification problems where coordinates are not well defined is expected to be inferior.

Another family of methods based on kernels shervashidze2011weisfeiler ; kriege2016valid shows strong performance on chemical datasets, but their application is limitted to small scale graph problems with discrete node features. Scalable extensions of kernel methods to graphs with continuous features were proposed niepert2016learning ; yanardag2015deep , but they showed weaker results.

6 Conclusion

In this work, we address several limitations of current graph convolutional networks and show competitive graph classification results on a number of chemical datasets. First, we revisit the spectral graph convolution model based on the Chebyshev polynomial, commonly believed to inherit shortcomings of earlier spectral methods, and demonstrate its ability to learn from graphs of arbitrary size and structure. Second, we design and study edge fusion methods for multi-relational graphs, and show the importance of validating these methods for each task to achieve optimal performance. Third, we propose a way to learn new edges in a graph jointly with a graph classification model. Our results show that the learned edges are complimentary to edges already annotated, providing a significant gain in accuracy.


This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation.



6.1 Overview of spectral graph convolution and its approximation

Following the notation of [11], spectral convolution on a graph having nodes is defined analogously to convolution in the Fourier domain (the convolution theorem) for some one-dimensional features over nodes and filter as [3, 1]:


where, are the eigenvectors of the normalized symmetric graph Laplacian, , where is an adjacency matrix of the graph , are node degrees. follows from the definition of eigenvectors, where is a diagonal matrix of eigenvalues. The operator denotes the Hadamard product (element-wise multiplication), and is a diagonal matrix with elements of in the diagonal.

The spectral convolution in (8) can be approximated using the Chebyshev expansion, where with and (i.e. terms contain powers ) and the property of eigendecomposition:


Assuming eigenvalues are fixed constants, filter can be represented as a function of eigenvalues , such that (8) becomes:


Filter can be then approximated as a Chebyshev polynomial of degree (a weighted sum of terms). Substituting the approximated into Eq. 10 and exploiting Eq. 9, the approximate spectral convolution takes the form of (see in [11, 17] for further analysis and [35] for derivations):


where is a rescaled graph Laplacian with as the largest eigenvalue of , are projections of input features onto the Chebyshev basis and are learnable weights shared across nodes. In this work, we further simplify the computation and fix ( varies from graph to graph), so that and assume no loops in a graph. has the same eigenvectors as , but its eigenvalues are .

6.2 Dataset statistics and network architectures

Dataset # graphs Architecture
NCI1 4110 3 111 29.87 37 GC32-GC64-GC128-D0.1-FC256-D0.1-FC2
NCI109 4127 4 111 29.68 38 GC32-GC64-GC128-D0.1-FC256-D0.1-FC2
MUTAG 188 10 28 17.93 7 GC32-GC32-GC32-D0.1-FC96-D0.1-FC2
ENZYMES 600 2 126 32.63 3 GC32-GC64-GC512-D0.1-FC256-D0.1-FC6
PROTEINS 1113 4 620 39.06 3 GC32-GC32-GC32-D0.1-FC96-D0.1-FC2
Table 2: Dataset statistics and graph network architectures. These statistics can also be found in [36] along with the datasets themselves. - number of nodes in a graph, - input dimensionality. GC - graph convolution layer, FC - fully connected layer, D - dropout.