Weisfeiler-Lehman Embedding for Molecular Graph Neural Networks

06/12/2020 ∙ by Katsuhiko Ishiguro, et al. ∙ IEEE Preferred Infrastructure 0

A graph neural network (GNN) is a good choice for predicting the chemical properties of molecules. Compared with other deep networks, however, the current performance of a GNN is limited owing to the "curse of depth." Inspired by long-established feature engineering in the field of chemistry, we expanded an atom representation using Weisfeiler-Lehman (WL) embedding, which is designed to capture local atomic patterns dominating the chemical properties of a molecule. In terms of representability, we show WL embedding can replace the first two layers of ReLU GNN – a normal embedding and a hidden GNN layer – with a smaller weight norm. We then demonstrate that WL embedding consistently improves the empirical performance over multiple GNN architectures and several molecular graph datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When finding new drugs or materials, supervised learning problems with structured data are frequently encountered. One of the most straightforward task settings is to predict the chemical properties (e.g., the toxicity) of a

molecular graph, which describes how atoms such as hydrogen and oxygen are connected through chemical bonds (Wu et al., 2018). Despite a tremendous number of different molecules, their properties often solely depend on local atomic patterns. For example, specific local patterns called functional groups in organic molecules are strong evidence for determining such chemical properties (Smith, 2020). Traditionally, this type of local information is incorporated as handcrafted features (Wiener, 1947; Todeschini and Consonni, 2009).

During the past few years, numerous studies have shown the benefit of applying a graph neural network (GNN) for molecular tasks (Duvenaud et al., 2015; Gilmer et al., 2017; Coley et al., 2019; Yang et al., 2019; Jin et al., 2018; De Cao and Kipf, 2018; Bradshaw et al., 2019)

. Currently, GNNs seem to be on track to achieve what convolutional neural networks (CNNs) accomplished for image tasks, nearly eliminating the need for handcrafted image features (e.g.,

(Krizhevsky et al., 2012)). However, recent theoretical and empirical results suggest a different outcome. Although the depth is an essential factor in representation learning (Bengio et al., 2013), a GNN incurs a “curse of depth”, i.e., the performance degrades when we stack too many layers (Li et al., 2018; Xu et al., 2019; NT and Maehara, 2019; Oono and Suzuki, 2020). How this problem can be overcome with a limited number of layers remains an important issue.

In this study, we revisit the idea of designing molecule features for a GNN. As a key contribution, we propose Weisfeiler-Lehman (WL) embedding, a simple, architecture-free embedding method that explicitly leverages the local atomic patterns. In contrast to a standard atomic embedding, which assigns a continuous vector to each single atom, WL embedding constructs an embedding vector by considering

neighboring atoms, from which we aim to extract subgraph information such as functional groups (Fig. 1). Although WL embedding may generate too many embedding vectors and cause the overfitting problem, we can mitigate the problem by combining the normal embedding. In terms of representability, we show WL embedding is equivalent to the normal embedding with a single hidden ReLU layer while reducing the model complexity characterized by the weight norm, which is a key factor for better trainability. We also demonstrate that WL embedding improves the prediction performance regardless of the molecular datasets or GNN architectures applied.

Figure 1: Overview of proposed WL embedding. Left: A molecular graph. Each node is labeled by atoms such as carbon (C) and oxygen (O). Middle Upper: Atomic embedding constructs an embedding vector for each single atom. Middle Lower: WL embedding constructs an embedding vectors for each combination of an atom and its neighboring atoms. Right: The node embedding is fed into a GNN with graph edges . Note that hydrogen (H) atoms are not counted as nodes in practice but here we explicitly use them for illustrative purposes.
Notation

Let be the space of nonnegative integer. Let be an undirected -labeled multigraph where is the set of nodes (i.e., atoms) and is the multiset of edges (i.e., chemical bonds111Bonds can have multiple types (single, double, ), but bond types are not used in the embedding. After embedding, GNNs may employ the bond type information (Schlichtkrull et al., 2017; Busbridge et al., 2019).) between nodes . Each node is associated with a discrete label . We denote by the neighboring nodes of , where multiple edges are regarded as single edges. The empty set is denoted by .

2 Related Work

GNNs and Curse of Depth

After the seminal study of graph convolutional networks (GCNs) by (Kipf and Welling, 2017), various GNN architectures have been proposed. For example, Gilmer et al. (2017) proposed a family of GNNs called message passing neural networks (MPNNs) and showed that MPNNs achieve the state-of-the-art performance for multiple molecular benchmark problems. Some researchers have proposed the use of GNNs specifically tailored for specific applications of molecular graphs (Bradshaw et al., 2019; Jin et al., 2019). For general applications, a gated graph neural network (GGNN) (Li et al., 2016) is a strong GRU-based GNN. The relational graph attention network (RelGAT) (Busbridge et al., 2019) is a representative of a non-local neural network (Wang et al., 2018), which uses multiple attention mechanisms for a set of edge types.

Along with their major successes, the limitations of GNNs have been gradually revealed. Li et al. (2018) reported an over-smoothing issue, that is, as the number of layers increases, the node representations become indistinguishable from each other. Although several studies (Li et al., 2018; Xu et al., 2019; Li et al., 2019; Zhao and Akoglu, 2020) have addressed this issue, no fundamental solution has yet been found, and particularly for molecular graphs, there are no practical deep architectures containing hundreds of layers, similar to ResNet (He et al., 2016).

WL Algorithm in Machine Learning

WL algorithm (Weisfeiler and Lehman, 1968)

is a well-adopted heuristics to identify (non-)isomorphism between graphs. The essence of the algorithm is to expand a label of each node by concatenating labels of neighboring nodes and to compare these labels between graphs (see Supplementary material for the actual procedure). The WL graph kernels 

(Shervashidze and Borgwardt, 2009; Shervashidze et al., 2011) adopted the idea of the WL label expansions directly as kernel features. Togninalli et al. (2018) applied the same idea to compute node representations for the graph kernels. Our study is on the same line, while we focus on GNNs instead of the graph kernels. We also extended the idea so that we can concurrently use the original labels to mitigate the sparsity problem; see the next section.

In the context of GNNs, several researchers have incorporated the idea of the WL algorithm in building new layer architectures. For example, Jin et al. (2017) and Lei et al. (2017) proposed MPNNs whose layer update rules are inspired through the mimicking of the WL kernel. Xu et al. (2019) and Morris et al. (2019) revealed that the WL algorithm has the upper bound capability of an MPNN, in terms of isomorphism testing. Based on the analysis, Xu et al. (2019) proposed Graph Isomorphism Networks (GIN) where the function in each layer is modeled by MLP. Morris et al. (2019) proposed another GNN architecture that can emulate the -dimensional WL algorithm for further capability. In contrast to these studies addressing middle-layer architectures, we incorporated the label expansion operation into the node embedding layer, the cornerstone of GNNs. WL embedding is thus combined with any of the above architectures.

Molecular Fingerprint

Encoding graph-theoretical or chemical information of molecules into a symbolic representation, known as a molecular fingerprint, has been a central problem in computational chemistry, at least dating back to the 1940s (Wiener, 1947). Fingerprints have a wide range of applications, such as a molecular similarity search or an analysis of quantitative structure-property relationships.

Owing to the long history of fingerprint in the area of computational chemistry, many methods have been proposed (c.f., Todeschini and Consonni, 2009). Among them, extended connectivity fingerprint (ECFP) (Rogers and Hahn, 2010) is one of the most popular fingerprint approaches. It is used to extract topological information of molecular graphs by enumerating and hashing the subgraphs of a graph within a specified radius. To achieve this, it aggregates the neighboring node information and expands the labels, similar to the WL algorithm. In this sense, WL embedding shares a similar design concept with ECFP. However, to the best of our knowledge, few attempts have been made to combine extended node labels to a GNN. In addition, some variants of our method create node embeddings for the center and neighboring nodes separately, whereas ECFP does not. We later show that this modification improves the empirical predictive performance for many different tasks.

3 WL Embedding

Node embedding is the first step of GNNs, which encodes a node into a continuous node feature vector . Node embedding is implemented with embedding vectors and a hash function as

(1)

Usually, we assign the embedding vector by the node label , i.e., , which we call atomic embedding.

To incorporate the neighbouring information, we use an extended label that is a tuple of a node label and its neighbouring labels 222A multiset is a generalization of a set that can contain overlapping elements. as a hash key. Namely, inspired by WL algorithm, we extend the hash function as where is an injective map from the extended label to an integer . Now we introduce WL embedding as

(2)

Fig. 1 illustrates how WL embedding works. In contrast to atomic embedding, WL embedding makes every embedded vector of carbon and oxygen atoms distinguishable because its neighborhoods are all different. This allows GNNs to treat frequent atoms that appeared in molecules in a more detailed way.

3.1 Extensions

Although the naive form of WL embedding (2) increases the representability of atoms, it may cause a sparsity problem. Namely, it may produce many extended labels that are rarely used, and learning the embedding vectors involving less-frequent atoms such as sulfur becomes more challenging with limited data.

In this paper, we propose the extensions of the naive WL idea that allows more robust and memory-efficient node embedding for molecular graph analysis. Our solution is to separate the atomic and neighbouring information into two embedding vectors and . Here, represents the atomic information, and, for example, every carbon node can share their knowledge through it. In contrast, reflects the neighboring information, which can maintain the high representability. To implement this idea, we consider two approaches.

C-WL Embedding

The first approach is to combine the two embedding vectors by concatenation. We define the concatenated WL (C-WL) embedding as follows:

(3)
(4)
(5)

where is a trainable weight matrix for linearly mixing and .

G-WL Embedding

Another approach is to employ a weighted interpolation between

and , where the mixing weight is computed through a gate function. We formalize the above idea as the Gated-sum WL (G-WL) embedding defined by

(6)
(7)
(8)
(9)

where denotes the Hadamard product,

is the element-wise sigmoid function, and

, are trainable parameters of gate functions .

The advantages of G-WL against naive WL are essentially the same as those of C-WL. As one difference, G-WL adopts an adaptive gate function for mixing and , whereas C-WL employs a fixed weight matrix to concatenate them. Adaptive gates allow the embedding module to tune the interpolated node embedding according to the combination of and . By contrast, parameters of the gate function require more complications of the training.

3.2 Combining with GNNs

After obtaining the node embedding , we feed them into multiple GNN layers such as MPNN layers. For example, an -layered MPNN iteratively updates the node latent vectors for each layer , where

(10)

Here, UPD(ate) and AGG(regate) are some appropriate functions (Gilmer et al., 2017). In addition, is initialized using the node embedding . The final updates of the latent vectors are denoted as . To predict a discrete or continuous target variable , we aggregate with an appropriate readout function and make prediction as .333Several authors proposed readout functions that additionally use graph topology (e.g., Ying et al. (2018)). For simplicity, we do not consider such functions. We train GNN layers, the readout function, and the embedding vectors.

3.3 Iterative Label Expansion

As with the WL algorithm, we can further extend the labels by looking up more than one-hop neighbors (e.g., neighborhoods of neighborhoods). Iterating the extensions increases the representation power because expanded labels can exploit long-distant nodes. By contrast, because the possible expanded label patterns combinatorially increase, too many extensions make the embeddings extremely sparse. In Supplemental material, we empirically investigated this trade-off using real datasets. Iterating label expansion twice improves the performance of C-WL and G-WL compared to the single iteration case, when a dataset has a large sample size. On the other hand, more iterations deteriorate the performance of the naive WL for all datasets.

4 Analysis

The process of WL embedding, generating node representations by gathering neighboring information, looks very similar to what the single iteration of GNN update rule (10) does. So how are they different? How does it make a difference in the outcome?

To answer this question, first we investigate the representability of WL embedding. Let be a finite set of graphs with node labels and

be a GNN with a linear classifier

and sum readout layer where is the representation of node created either by a GNN or WL embedding. Since the dimensionality of the input (representation) space roughly determines the capacity of a linear classifier (e.g., VC dimension; see book Mohri et al. (2018)), we check the best-possible dimensionality of a vector space spanned by .

Definition 1.

Let be the set of GNN parameters including embedding vectors. We define the maximum dimensionality of a readout output for a graph set by .

Suppose we employ the naive WL embedding vector (2) as the node representation . Since is completely determined by its label and neighbouring labels , we can enumerate their patterns. Given a graph , let be the multiplicities of neighbouring labels where counts the number of connected nodes labeled as . Let be the number of nodes labeled as with neighbourhood pattern . The readout is then written as

(11)

(we interchangeably used the multiplicities as the multiset .) We see that

is in the linear space spanned by the embedding vectors, and its degree of freedom is maximized when all the embedding vectors are linearly independent. WL embedding immediately achieves this because by definition it can assign a unique vector for each pair

. When any node in connects with at most nodes of the same label (i.e., for ), the number of the combinations of the pairs is bounded above by , which determines the maximum dimensionality.

Proposition 2.

Let be the set of graphs with node labels and label-specific maximum degree . The maximu dimensionality of WL embedding (11) is bounded as . The upper bound is achieved if the embedding dimension satisfies .

Next, suppose the node representation is given by a single GNN layer defined as

(12)

where

is an activation function,

are label-specific weight matrices, and is the atomic embedding vector of node . GNNs defined by (12) can be seen as a restricted class of multiset-broadcasting GNNs Sato et al. (2019), including GraphSAGE Hamilton et al. (2017), GCN Kipf and Welling (2017), and GIN Xu et al. (2019). Using the same notation of (11), the readout output is written as

(13)

where , , and for . Here, is a point of the -dimensional lattice (grid) spanned by the basis . To maximize the dimensionality of , the activation function needs to map every grid point of

to a linearly independent vector. Following the strategy of

Yun et al. (2018), we show the construction of such mapping with the rectifier activation. The proof is deferred to Supplementary material.

Theorem 3.

The maximum dimensionality of GNN (13) is bounded as . With the rectifier activation , the upper bound is achieved if the embedding dimension and the layer width satisfy and the norm of the parameters satisfies for .

       

Figure 2: Left: Learning curves in a synthetic experiment (the subgraph detection task in Section 5). The atomic embedding is followed by a single layer of ReLU GCN with different layer width. WL embeddings are directly fed into the readout layer (no GCN layer). Right: 2D grids where and

with hyperplanes (lines) that contain the center points while keeping away from the other points.

Although Theorem 3 shows that WL embedding and the single ReLU layer have the same representation power, the condition of the parameter scale implies the realization is not efficient. The scale of the parameters444The scale increases with the number of labels and the label-specific degree because of the following reason. To assign an independent vector to each grid point, we need a layer unit that activates only by a specific point. We can create such a unit if there exists a hyperplane that only contains the point. Although we can always find such hyperplanes, their gradient is inevitably increased as the number of points increases to avoid hitting the other grid points, as shown in Figure 2 (right). is significantly larger than in a normal setting555For example, is a popular choice as the initial scale for ReLU layers (e.g. He et al. (2015)).

. Attaining such a large-scale solution is challenging for stochastic gradient descent and its variants because their implicit bias favors small-norm solutions 

Zhang et al. (2016); Soudry et al. (2017). Indeed, we empirically verified this. Figure 2 (left) shows that the atomic embedding with a single-layer GCN could not achieve the zero training error, showing the hardness of training (see the next section for the detailed setting).

5 Experiment: Synthetic Graphs

To examine whether WL embedding behaves as we intended, we prepare two toy tasks: label counting and subgraph detection. The label counting task is to predict the number of specific node labels, which is solvable without graph structure. The subgraph detection task is to classify whether a graph has specific subgraphs, which requires neighboring information.

We prepare datasets consisting of two types of artificial graphs: positive graphs and negative graphs. For the positive graphs, we first generate a -node regular graph of degree

and remove edges independently and randomly with probability

. Then we attach one of three target subgraphs, all of which have nodes, by adding an edge to each node pair of two graphs independently randomly with probability . We adopt the generated graph if and only if it is connected, the degrees of all nodes are smaller than or equal to , and it contains exactly one of the target subgraphs. For the negative graphs, we generate a -node regular graph of degree and remove edges independently randomly with probability . We adopt the generated graph if and only if it is connected and does not include any of the target subgraphs. For both procedures, we associate one of five labels with each node of the graph independently uniformly randomly.

For each task, we generate three datasets. Each dataset has positive graphs and

negative graphs. For each pair of datasets and tasks, we conduct five runs with random initialization. All hyperparameters are manually fixed and shared in all trials. We adopt

-layer nonlinear GCNs with atomic and WL embeddings for .

Figure 3 clearly shows that, while both embeddings behave mostly identically at the label counting task (Left), WL embeddings significantly outperform atomic embedding at the subgraph detection task (Right). This indicates that local structural information is actually important for node embedding, even with nonlinear activation. The results also capture the “curse of depth”. At both tasks, the performance gradually declines after two layers. Nevertheless, WL embeddings always improve the performance at the detection tasks, implying that WL embeddings somehow dispel the curse for neighborhood-sensitive tasks.

Figure 3:

Experiments with artificial graph datasets. Averages and standard deviations of 15 trials are presented. Left: MAEs (smaller is better) of the counting tasks over the number of layers. Right: ROC-AUCs (larger is better) of the detection task over the number of layers.

6 Experiment: Molecular Benchmarks

In this section, we report our main empirical results on benchmark datasets. We compare the performances of the prediction among the aforementioned embedding techniques, combined with multiple GNN architectures. Details can be found in the supplementary material.

6.1 Datasets and Tasks

We use eight datasets in MoleculeNet (Wu et al., 2018): three for regression and five for classification. In all datasets, we used the train/validation/test data splits of a “scaffold” type, which is considered to be difficult for test predictions (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014). For the graph regression tasks, we use the QM9 and the QM8 datasets from the field of quantum chemistry, and the Lipophilicity (LIPO) dataset from the field of physical chemistry. The task is to predict the associated numerical value(s) from the molecular graphs. We evaluate the performance of the models using the mean absolute errors (MAEs) over properties and molecules. For the graph classification tasks, we use the Tox21, the ToxCast and the Clintox datasets from the field of physiology, and the HIV and the MUV datasets from the field of biophysics. The task is to predict the binary label(s) from the molecular graphs. We evaluate the performance of the models using the ROC-AUC values over targets and molecules.

6.2 Setting

We tested the embedding approaches on five popular GNN models: GCN, GGNN, RelGAT, GIN, and Neural FingerPrint (NFP) (Duvenaud et al., 2015), which is a GNN-based learnable molecular fingerprint. For each GNN, we implemented the readout function described in the original paper. We employed the softmax cross-entropy loss for classification tasks and the squared loss for regression tasks. For each dataset-model pair, we optimized the following hyperparameters by using Optuna (Akiba et al., 2019) based on the validation score: the number of GNN layers , the dimension of latent hidden vectors and embedding vectors , and the of the Adam optimizer (Kingma and Ba, 2015). The other architecture-specific hyperparameters such as the MLP architecture of GIN layers were fixed throughout the study. All the experiments were repeatedly conducted with 10 different random seeds.

6.3 Main Results

Tables 1 and 2 show the generalization performance (i.e., test scores) over (datasets) (GNNs) cases. Overall, at least one of the WL embedding approaches consistently improves the generalization performance in all the cases except the Tox21-NFP pair. In particular, C-WL embedding performs better than the atomic embedding in 37 out of 40 cases. G-WL embedding and the naive WL embedding also perform better than the atomic embedding in many cases but are less stable compared to the C-WL. In addition, C-WL achieves the best test scores among four embeddings in 20 cases. These results indicate the importance of incorporating neighboring node labels into the node embedding for GNNs.

Dataset Embedding GCN RelGAT GGNN GIN NFP
LIPO Atomic 0.7077 0.6298 0.6670 0.7034 0.7067
Naive WL   +0.0744   +0.0462   -0.0716   +0.0641   +0.0582
C-WL   +0.0553   +0.0487   +0.0707   +0.0179   +0.0713
G-WL   +0.0688   +0.0453   +0.0255   +0.0192   +0.0582
QM9 Atomic 23.7247 11.8440 4.6252 6.0166 5.3825
Naive WL   +12.9973   +6.0987   +0.6665   +0.2998   +0.0226
C-WL   +13.2291   +6.1149   +0.4436   -0.2615   +0.1322
G-WL   +11.6967 N. A.   +0.4553   -0.8769   +0.1061
QM8 Atomic 0.0310 0.0234 0.0209 0.0322 0.0365
Naive WL   +0.0006   +0.0030   -0.0029   +0.0005   +0.0032
C-WL   +0.0004   +0.0020   +0.0009   0.0000   +0.0025
G-WL   +0.0018   +0.0029   +0.0004   +0.0006   +0.0027
Table 1: Ten-run averages of mean absolute errors (MAEs) of the regression benchmarks. For the atomic embedding, we present the test MAEs. For WL embeddings, we present the error reduction from the atomic embedding (i.e., positive values indicate the performance improvements). We highlight the results where the performance was improved in bold and the best results of each dataset in italics. For the QM9 dataset, we couldn’t obtain the result of RelGAT + G-WL due to the large data size and intensive memory usage. The full table with standard deviations can be found in Supplemental material.
Dataset Embedding GCN RelGAT GGNN GIN NFP
Tox21 Atomic 0.7082 0.7432 0.7136 0.7237 0.7618
Naive WL   +0.0606   -0.0499   +0.0142   -0.0251   -0.0425
C-WL   +0.0417   +0.0177   +0.0548   +0.0160   -0.0193
G-WL   +0.0625   -0.0013   +0.0442   +0.0347   -0.0312
HIV Atomic 0.7507 0.7037 0.7044 0.7108 0.6533
Naive WL   -0.0221   -0.0098   +0.0142   -0.0064   +0.0988
C-WL   +0.0038   +0.0206   +0.0139   +0.0134   +0.0593
G-WL   -0.0127   +0.0080   +0.0087   -0.0070   +0.0984
ToxCast Atomic 0.7745 0.7820 0.7720 0.7691 0.7585
Naive WL   +0.0182   -0.0358   -0.0222   +0.0043   -0.0005
C-WL   +0.0086   +0.0193   +0.0034   +0.0103   +0.0123
G-WL   +0.0043   -0.0017   +0.0155   +0.0053   -0.0284
clintox Atomic 0.9387 0.9287 0.9338 0.9305 0.9214
Naive WL   -0.0017   +0.0071   +0.0091   -0.0050   +0.0122
C-WL   +0.0158   +0.0072   +0.0100   +0.0026   +0.0195
G-WL   -0.0034   -0.0082   +0.0173   -0.0014   +0.0144
muv Atomic 0.6350 0.6164 0.6586 0.5716 0.6839
Naive WL   +0.0778   +0.0560   +0.0738   +0.0786   +0.0523
C-WL   +0.0503   +0.0513   +0.0931   +0.0055   +0.0288
G-WL   +0.0456   +0.0223   +0.0326   +0.0287   +0.0278
Table 2: Ten-run averages of ROC-AUCs for the classification benchmarks. The table format is the same as Table 1, excepting that AUC gains from the atomic embedding are presented for WL embeddings.

7 Conclusion

In this study, we considered the design of molecular features for GNNs. Based on the history of chemical studies, we proposed the use of WL embedding, a simple embedding method leveraging local atomic patterns. We also proposed its variants, C-WL and G-WL embeddings, which alleviate the sparsity problem that might arise in the naive WL embedding. We showed that the representation power of WL embedding is equivalent to that of atomic embedding + ReLU GNN layer while reducing the parameter scale. The efficacy of the WL embeddings, especially C-WL embedding, was presented in intensive experimental validations against multiple GNN architectures and multiple benchmark molecular graph datasets with different chemical properties.

References

  • Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2019.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Bradshaw et al. (2019) John Bradshaw, Matt J. Kusner, Brooks Paige, Marwin H. S. Segler, and José Miguel Hernández-lobato. A Generative Model for Electron Paths. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • Breiman (2001) L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
  • Busbridge et al. (2019) Dan Busbridge, Dane Sherburn, Pietro Cavallo, and Nils Y Hammerla. Relational graph attention networks. arXiv preprint arXiv:1904.05811, 2019.
  • Coley et al. (2019) Connor W Coley, Wengong Jin, Luke Rogers, Timothy F Jamison, Tommi S Jaakkola, William H Green, Regina Barzilay, and Klavs F Jensen. A graph-convolutional neural network model for the prediction of chemical reactivity. Chemical science, 10(2):370–377, 2019.
  • De Cao and Kipf (2018) Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
  • Duvenaud et al. (2015) David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Advances in Neural Information Processing Systems 28 (Proceedings of NIPS), pages 2224–2232, 2015.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1263–1272, 2017.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30 (Proceedings of NIPS), 2017.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 1–9, 2016.
  • Hu et al. (2020) Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural neworks. In Proceedings of the 2020 International Conference on Learning Representations (ICLR), to appear, 2020. URL https://openreview.net/forum?id=HJlWWJSFDH.
  • Ishiguro et al. (2019) Katsuhiko Ishiguro, Shinichi Maeda, and Masanori Koyama. Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular Graph Analysis. arXiv, page 1902.01020 [cs.LG], 2019.
  • Jin et al. (2017) Wengong Jin, Connor W. Coley, Regina Barzilay, and Tommi Jaakkola. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. In Advances in Neural Information Processing Systems 30 (Proceedings of NIPS), 2017.
  • Jin et al. (2018) Wengong Jin, Regina Barzilay, and Tommi Jaakkola.

    Junction tree variational autoencoder for molecular graph generation.

    In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
  • Jin et al. (2019) Wengong Jin, Kevin Yang, Regina Barzilay, and Tommi Jaakkola. Learning multimodal graph-to-graph translation for molecular optimizatoin. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. Semi-supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.
  • Klicpera et al. (2020) Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional Message Passing For Molecular Graphs. In Proceedings of the 2020 International Conference on Learning Representations (ICLR), to appear, 2020. URL openreview.net/forum?id=B1eWbxStPH.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Lei et al. (2017) Tao Lei, Wengong Jin, Regina Barzily, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
  • Li et al. (2019) Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. DeepGCNs: Can GCNs go as deep as CNNs? In Proceedings of the IEEE International Conference on Computer Vision, pages 9267–9276, 2019.
  • Li et al. (2018) Qimai Li, Zhichao Han, and Xiao-ming Wu.

    Deeper Insights into Graph Convolutional Networks for Semi-supervised Learning.

    In

    Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18)

    , 2018.
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated Graph Sequence Neural Networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016.
  • Liu et al. (2019) Shengchao Liu, Mehmet Furkan Demirel, and Yingyu Liang Liang. N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules. In Advances in Neural Information Processing Systems 32 (Proceedings of NeurIPS), 2019.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Nueral Information Processing Systems 26 (Proceedings of NIPS), pages 1–9, 2013.
  • Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
  • Morris et al. (2019) Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • NT and Maehara (2019) Hoang NT and Takanori Maehara. Revisiting Graph Neural Networks: All We Have is Low-Pass Filters. arXiv, page 1905.09550 [stat.ML], 2019.
  • Oono and Suzuki (2020) Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In Proceedings of the 2020 International Conference on Learning Representations (ICLR), to appear, 2020. URL openreview.net/forum?id=S1ldO2EFPr.
  • Pham et al. (2017) Trang Pham, Truyen Tran, Hoa Dam, and Svetha Venkatesh. Graph Classification via Deep Learning with Virtual Nodes. arXiv, page 1708.04357v1, 2017.
  • Ramakrishnan et al. (2014) Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1:140022, 2014.
  • Rogers and Hahn (2010) David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
  • Ruddigkeit et al. (2012) Lars Ruddigkeit, Ruud van Deursen, Lorenz C. Blum, and Reymond Jean-Louis. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of chemical information and modeling, 52(11):2864–2875, 2012.
  • Sato et al. (2019) Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Approximation ratios of graph neural networks for combinatorial problems. In Advances in Neural Information Processing Systems, pages 4083–4092, 2019.
  • Schlichtkrull et al. (2017) Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling Relational Data with Graph Convolutional Networks. arXiv, page 1703.06103v4 [stat.ML], 2017.
  • Shervashidze and Borgwardt (2009) Nino Shervashidze and Karsten M Borgwardt. Fast Subtree Kernels on Graphs. In Advances in Neural Information Processing Systems 22 (Proc. NIPS), pages 1660–1668, 2009.
  • Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erick Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
  • Smith (2020) Michael B Smith. March’s advanced organic chemistry: reactions, mechanisms, and structure. John Wiley & Sons, 2020.
  • Soudry et al. (2017) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
  • Takamoto and Izumi (2019) So Takamoto and Satoshi Izumi. TeaNet: Universal Neural Network Interatomic Potential Inspired by Iterative Electronic Relaxations. arXiv, pages 1912.01398v1 [physics.comp–ph], 2019.
  • Todeschini and Consonni (2009) Roberto Todeschini and Viviana Consonni. Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references, volume 41. John Wiley & Sons, 2009.
  • Togninalli et al. (2018) Matteo Togninalli, Elisabetta Ghisu, Felipe Llinares-López, Bastian Rieck, and Karsten Borgwardt. Wasserstein Weisfeiler-Lehman Graph Kernels. In Advances in Neural Information Processing Systems 31 (Proceedings of NeurIPS), 2018.
  • Wang et al. (2018) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Weisfeiler and Lehman (1968) b. Weisfeiler and A. A. Lehman. A Reduction of a Graph to a Canonical Form and an Algebra Arising during this Reduction. Nauchno-Technicheskaya Informatsia, Ser. 2(9):12–16, 1968.
  • Wiener (1947) Harry Wiener. Structural determination of paraffin boiling points. Journal of the American Chemical Society, 69(1):17–20, 1947.
  • Wu et al. (2018) Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. MoleculeNet : A Benchmark for Molecular Machine Learning. Chemical Science, 9(2):513–530, 2018.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.
  • Yang et al. (2019) Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Are Learned Molecular Representations Ready For Prime Time? arXiv, page 1904.01561, 2019.
  • Ying et al. (2018) Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical Graph Representation Learning with Differentiable Pooling. In Advances in Neural Information Processing Systems 31 (Proceedings of NeurIPS), 2018.
  • Yun et al. (2018) Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. arXiv preprint arXiv:1810.07770, 2018.
  • Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • Zhao and Akoglu (2020) Lingxiao Zhao and Leman Akoglu. PairNorm: Tackling Oversmoothing in GNNs. In Proceedings of the 2020 International Conference on Learning Representations (ICLR), to appear, 2020. URL https://openreview.net/forum?id=rkecl1rtwB.

Appendix A Additional Related Work

a.1 Improving GNNs for Multiple Datasets and Architectures

After the seminal study of GCN [Kipf and Welling, 2017], many GNN architectures have been proposed, which introduce inductive biases suit to a dataset of interest. Unfortunately, the community still is not aware of the golden-standard GNN architecture that works fine across datasets and downstream tasks [Wu et al., 2018, Ishiguro et al., 2019]. This makes a clear contrast with ResNet [He et al., 2016] in the computer vision community. Therefore users of GNNs must choose the good GNN architecture for each dataset and downstream task.

A few researchers are interested in architecture-free techniques that improve performances of existing GNNs across several datasets and types of downstream tasks. Hu et al. [2020] proposed a strategy for pre-training GNNs with massive unsupervised and smaller supervised datasets. In the experiments, pre-trained GNNs obtained better performances over chemical and biological graph datasets. Ishiguro et al. [2019] developed a sub-network module that can be attached to generic MPNNs. The module bypasses neighborhood-based information propagation via a virtual super-node and gate mechanisms. The module consistently improves the performances of six different GNNs for molecular graph datasets.

We share the same goal with these works. We design the embedding so as not to fine-tune to specific a GNN architecture. Our experiments demonstrate that the proposed embedding consistently improves the performances of different GNNs for multiple molecular graph datasets.

a.2 GNNs inspired by WL algorithm

Jin et al. [2017], Lei et al. [2017] propose MPNNs whose layer update rules are inspired by mimicking the WL kernel. The main idea is to concatenate the latent vectors of the neighboring nodes.

Xu et al. [2019], Morris et al. [2019] reveal that the WL algorithm has the upper bound capability of the MPNNs, in terms of isomorphism testing. Xu et al. [2019]

proves that the bijectivity is required for layer update functions to achieve the upper bound. The authors propose a GIN architecture where the function in each layer is modeled by Multi-layer Perceptrons.

Morris et al. [2019] proposes to emulate the -dimensional WL algorithm for further capability. As explained in the main manuscript, all of these previous works propose new GNN architecture (layer formulations) but do not attend the raw node feature representations before applying GNN layers.

a.3 Rich Atom Information

If the molecular graphs of interest provide additional features (e.g., electronic charge of an atom) other than discrete node labels, we can leverage these additional features. Some GNN studies propose to combine such features with their network architectures (e.g., [Yang et al., 2019, Pham et al., 2017]). Recently, a few GNN studies are interested in inferring the (imperfect) 3D structure of molecule graphs based on distance information between atom nodes [Takamoto and Izumi, 2019, Klicpera et al., 2020]. These approaches are orthogonal to our proposal. We can augment these GNN studies by simply concatenating the node label embedding by our proposed method as an additional feature for nodes.

a.4 Other topics

Liu et al. [2019] proposed an interesting and unique approach for molecular graph analysis, based on the Bag of Random-walks approach. This method computes the graph representation based on the Bag of Random-walk embeddings. The embedding is based on the node representation. This method first trains the NN to compute the node embedding in an unsupervised manner. The training objective is based on the loss of CBoW [Mikolov et al., 2013] in NLP tasks. Namely, a node’s embedding vector is determined to be predictable from the neighboring nodes’ embedding vectors.

Appendix B 1-Dimensional Weisfeiler–Lehman Algorithm

The -Dimensional WL algorithm [Weisfeiler and Lehman, 1968] is a heuristic used to identify a (non-)isomorphism between two graphs, and , based on -tuples in the graphs.

The most popular algorithm is the -dimensional WL algorithm. This algorithm expands a label of each node by concatenating the labels of neighboring nodes and relabels (hash) them in a new label. The algorithm consists of three steps [Shervashidze et al., 2011, Morris et al., 2019]:

  1. Collecting neighbor labels: For each node , labels in are collected and form a multiset of labels, .

  2. Expanding the label: is attached to to obtain the expanded label . Namely, .

  3. Relabeling: is relabeled by hashing : , where HASH() bijectively maps to a unique value in a label set . If first appears within the computation, the HASH augments with a new symbol that is not seen in and returns that new symbol.

Performing the above steps for all nodes in two graphs, the WL algorithm tests the following termination condition: If there is a node label that only appears in one graph, then the two graphs are not isomorphic. Otherwise, the WL algorithm repeats the above steps.

Appendix C Proof of Theorem 3

Our goal is to construct a one-to-one mapping from lattice points to linearly independent vectors. We first fix in (13) to an arbitrary number so that we can focus on the single lattice , and then we generalize the result for all . Let be a matrix whose row contains of each lattice point. Our strategy is to make triangular with non-zero diagonal elements so that all the row vectors are automatically linearly independent with sufficiently large .

Suppose we employ the -dimensional standard basis scaled by as the embedding vectors, i.e., . Now in (13) and , the basis of , are directly determined by , and , and without loss of generality we consider the parameterization instead of . Let be again the -dimensional standard basis, and we simply have for . As a weight vector, we consider so that the inner product calculates the number of in base (e.g., for when , which is equivalent to the number of the binary bits ). By using this idea, we can set the weight and the bias as

(14)

We see that the number of nonzero elements of corresponds to the number of in base and becomes upper triangular, i.e.,

(15)

where is a matrix whose -th from the bottom contains the digits of in base , i.e.,

(16)

Hence such meet our goal.

Next, we measure their scale. The -norm of is given by the sum of the squares of the sequence from to . Substituting into the formula and since is constant, we conclude that . The same thing holds for .

To generalize this for all , we can extend and as block-diagonal where each block of them contain their original contents, which keep linearly independent for all lattices.

Appendix D Dataset Specification

d.1 MoleculeNet Datasets

MoleculeNet [Wu et al., 2018] is a standard collection of molecular graph datasets. It contains several molecular datasets from four different chemical fields: quantum chemistry, physical chemistry, physiology, and biophysics.

d.1.1 Dataset Details

For the graph regression tasks, we used the QM9 dataset, the QM8 dataset, and the Lipophilicity (LIPO) dataset. QM9 is a dataset containing approximately 133K molecules with 9 types of heavy atoms. The dataset consists of 12 important numerical values (target variables) of the chemical-energetic, electronic, and thermodynamic properties, such as HOMO, LUMO, and electron gaps computed through quantum chemistry techniques. QM8 is another quantum chemical dataset containing approximately 22K molecules with eight types of heavy atoms. QM8 also consists of 12 important chemical-energetic, electronic, and thermodynamic properties; however, different quantum chemistry methods are used for the computations. The LIPO dataset contains the solubility values of roughly 4K drug molecules. This dataset is taken from the physical chemistry field. Each sample in these datasets is a pair of a molecular graph and a numerical value(s): the 12 chemical properties in the QM9 and QM8 dataset, and the solubility in the LIPO dataset. For these datasets, the task is to predict the numerical value(s) from the molecular graph. We evaluated the performance of the models using mean absolute errors (MAEs). We report the averaged MAE over 12 sub-tasks (properties) for QM9 and QM8.

For the graph classification tasks, we used the Tox21, the HIV, the ToxCast, the Clintox, and the MUV datasets. The Tox21 dataset contains approximately 8K pairs of a molecular graph and a 12-dimensional binary vector that represents the experimental outcomes of the toxicity measurements on 12 different targets. The HIV dataset contains roughly 42K pairs of a molecular graph and a binary label that represent the medicinal effect of the molecule. The ToxCast dataset contains about 8K pairs of a molecular graph and a 617-dimensional binary vector that represent the different experimental results. The Clintox dataset contains 1491 pairs of a molecular graph and two binary labels that represent the clinical toxicity and the FDA approval status. The MUV (Maximum Unbiased Validation) dataset contains 93 molecular graph samples attached with 17 labels, specifically designed for virtual screening validations. For these datasets, the task is to predict the binary label(s) from the molecular graph. The Tox21, ToxCast, and the Clintox datasets are taken from the chemical physiology field. HIV and the MUV datasets belong to the biophysics collection of MoleculeNet. For these tasks, we use ROC-AUC values as a measure of performance. We report the averaged ROC-AUC over 12 sub-tasks (targets) for Tox21, 617 sub-tasks for ToxCast, 2 sub-tasks for Clintox, and 17 sub-tasks for MUV, respectively.

d.1.2 Graph Data Representation

All real-world datasets used in our experiments are provided in the SMILES format. A SMILES format is a line notation for describing the structure of chemical compounds. We decode a SMILES molecular data into a multigraph representation of the molecule. A node in the graph corresponds to an atom. Each atom node is associated with the symbolic label of the atom name (“H”, “C”, …).

An edge in the graph corresponds to a bond between atoms. Each bond edge is associated with the bond type information (single, double, ….). The edges have connectivity information between nodes (topology), as well the bond type information.

d.1.3 Data splits

MoleculeNet provides several ways of data splitting. The “random” split is the random sample shuffling that is most familiar to the machine learning community. The “scaffold” split separate samples based on the molecular two-dimensional structure. Since the scaffold split separates structurally different molecules into different subsets, “it offers a greater challenge for learning algorithms than the random split” [Wu et al., 2018]. Throughout the paper, we adopt the scaffold split to assess the full potential of the GWM-attaching GNNs.

The actual construction of the scaffold split train/validation/test subsets has freedom of algorithm choices. We adopted the algorithm provided by the deepchem666https://deepchem.io/ library, which is the standard split algorithm for many papers.

Appendix E Model Specification

e.1 Readout Layer

In many applications of GNNs users may expect a single fixed-length vector representing the characteristics of the graph . So we add the ’readout’ layer to aggregate the node latent vectors.

The main issue in the readout unit is how to aggregate nodes, whose number varies for each graph. A simple way is to take an arithmetic average (sum) of the node representations at the -th layer, but we can also use a DNN to compute (non-linear) “average” of the node representations [Li et al., 2016, Gilmer et al., 2017]. In the experiments, we implement the readout layer that is described in the original papers of each GNN architecture.

e.2 Hyperparameters

In the main experiments, we optimize three hyperparameters. (i) : the number of layers of GNNs, (ii) : the dimension of latent hidden vectors of nodes in GNN layers, and (iii) : the step size of Adam [Kingma and Ba, 2015] optimizer.

We optimize these hyperparameters using Optuna [Akiba et al., 2019] library. Target functions to be optimized is MAE (for regression) or ROC-AUC (for classification) on the validation dataset. Tables 3,4 present the chosen , , and . The maximum of RelGAT is set to 32, to avoid the memory shortage errors.

Other hyperparameters are fixed and shared among experiemnts. We trained each GNNs for 500 epochs. The minibatch size is fixed for 128 for GCN and GGNN, and 32 for RelGAT because of the GPU memory limits. All models were trained with Adam 

[Kingma and Ba, 2015], , and .

Dataset Embedding GCN RelGAT GGNN GIN NFP
LIPO Atomic (110, 3, 0.0017) (16, 4, 0.0022) (37, 3, 0.0091) (81, 3, 0.0015) (53, 5, 0.0062)
Naive WL (124, 2, 0.0018) (21, 3, 0.0038) (52, 2, 0.013) (37, 2, 0.0064) (176, 5, 0.00095)
C-WL (188, 2, 0.00016) (23, 4, 0.0026) (45, 5, 0.0032) (17, 2, 0.0016) (233, 2, 0.00017)
G-WL (164, 2, 0.0015) (21, 4, 0.0034) (45, 4, 0.0080) (39, 3, 0.0014) (232, 5, 0.00041)
QM9 Atomic (140, 2, 0.00031) (41, 4, 0.0022) (232, 5, 0.00029) (256, 4, 0.00016) (135, 6, 0.00049)
Naive WL (234, 2, 0.00047) (205, 3, 0.00069) (254, 6, 0.00018) (214, 2, 0.00032) (73, 5, 0.00081)
C-WL (253, 2, 0.00020) (56, 3, 0.0021) (248, 6, 0.00014) (255, 2, 0.00014) (162, 5, 0.00035)
G-WL (229, 3, 0.00011) N.A. (237, 6, 0.00035) (167, 2, 0.00055) (256, 6, 0.00034)
QM8 Atomic (148, 5, 0.00096) (32, 4, 0.0014) (122, 5, 0.00097) (253, 2, 0.00023) (253, 5, 0.00033)
Naive WL (93, 4, 0.0021) (88, 6, 0.00056) (174, 4, 0.0013) (137, 2, 0.00028) (139, 5, 0.00015)
C-WL (137, 5, 0.00011) (58, 5, 0.00097) (150, 3, 0.0010) (300, 2, 0..00023) (162, 6, 0.00024)
G-WL (254, 5, 0.00023) (16, 3, 0.0018) (277, 2, 0.0013) (86, 3, 0.00018) (255, 5, 0.000049)
Table 3: Hyperparameter choices for the DeepChem-scaffold splits, regression tasks. Each cell is formatted as (, , ) where denotes the dimension of latent vectors, denotes the number of layers, and denotes the step size of Adam.
Dataset Embedding GCN RelGAT GGNN GIN NFP
Tox21 Atomic (22, 4, 0.00024) (15, 3, 0.0027) (32, 3, 0.0018) (126, 4, 0.0012) (66, 2, 0.0011)
Naive WL (112, 2, 0.000088) (22, 4, 0.00042) (52, 2, 0.0029) (29, 2, 0.012) (119, 2, 0.0057)
C-WL (235, 3, 0.0023) (14, 2, 0.000054) (27, 3, 0.0055) (49, 5, 0.0027) (80, 3, 0.0039)
G-WL (51, 4, 0.0020) (7, 2, 0.0084) (75, 4, 0.0090) (55, 4, 0.00036) (78, 4, 0.0091)
HIV Atomic (105, 4, 0.00022) (6, 2, 0.0097) (23, 2, 0.0025) (74, 2, 0.00021) (56, 4, 0.0031)
Naive WL (50, 5, 0.0030) (16, 4, 0.0020) (27, 4, 0.000042) (28, 2, 0.00041) (230, 3, 0.000016)
C-WL (86, 6, 0.0011) (20, 3, 0.000065) (45, 2, 0.0017) (19, 2, 0.0031) (57, 3, 0.0020
G-WL (43, 4, 0.0016) (5, 3, 0.000053) (110, 3, 0.0027) (42, 2, 0.0018) (39, 2, 0.00020)
ToxCast Atomic (177, 3, 0.000017) (20, 3, 0.0068) (30, 4, 0.0021) (80, 3, 0.0024) (121, 2, 0.0020)
Naive WL (73, 2, 0.00020) (17, 3, 0.00079) (71, 5, 0.0028) (25, 3, 0.019) (25, 2, 0.0022)
C-WL (103, 5, 0.00039) (20, 3, 0.0050) (128, 4, 0.0046) (64, 3, 0.0052) (79, 2, 0.00035)
G-WL (80, 2, 0.0037) (15, 3, 0.0110) 47, 2, 0.0153) (81, 4, 0.00016) (185, 2, 0.0020)
clintox Atomic (135, 5, 0.0034) (4, 3, 0.014) (92, 5, 0.0040) (34, 3, 0.0046) (17, 4, 0.0094)
Naive WL (99, 4, 0.00049) (11, 3, 0.078) (19, 6, 0.0031) (38, 5, 0.0029) (49, 3, 0.032)
C-WL (74, 3, 0.00063) (5, 3, 0.036) (24, 4, 0.0060) (124, 3, 0.0015) (131, 2, 0.0021)
G-WL (71, 4, 0.021) (6, 2, 0.020) (75, 5, 0.00076) (102, 6, 0.010) (135, 5, 0.021)
muv Atomic (36, 3, 0.00021) (8, 4, 0.0012) (50, 5, 0.00041) (214, 4, 0.000021) (28, 5, 0.0017)
Naive WL (55, 3, 0.00064) (11, 3, 0.0018) (148, 2, 0.00013) (104, 2, 0.000086) (178, 3, 0.000065)
C-WL (102, 3, 0.00011) (22, 4, 0.00053) (154, 5, 0.000016) (101, 3, 0.000018) (55, 5, 0.00034)
G-WL (143, 3, 0.000076) (22, 3, 0.0012) (24, 3, 0.0012) (20, 4, 0.000015) (169 5, 0,00019)
Table 4: Hyperparameter choices for the DeepChem-scaffold splits, classification tasks. Each cell is formatted as (, , ) where denotes the dimension of latent vectors, denotes the number of layers, and denotes the step size of Adam.

Appendix F Full Results on Molecular Graph Experiments

Tables 5,6 present the full results of the Tables 1,2 in the molecular graph data experiments.

Dataset Embedding GCN RelGAT GGNN GIN NFP
LIPO Atomic 0.7077 0.0154 0.6298 0.0208 0.6670 0.0206 0.7034 0.0251 0.7067 0.0883
Naive WL 0.6333 0.0112 0.5836 0.0311 0.7386 0.0960 0.6393 0.0164 0.6485 0.0194
C-WL 0.6524 0.0165 0.5811 0.0143 0.5963 0.0251 0.6855 0.0228 0.6354 0.0172
G-WL 0.6389 0.0198 0.5845 0.0169 0.6415 0.0229 0.6842 0.0481 0.6485 0.0105
QM9 Atomic 23.7247 1.2825 11.8440 0.7028 4.6252 0.5631 6.0166 0.2230 5.3825 0.1375
Naive WL 10.7274 0.6118 5.7453 0.1262 3.9587 0.0862 5.7168 0.0919 5.3599 0.2018
C-WL 10.4956 0.6118 5.7291 0.3087 4.1816 0.1758 6.2781 0.3279 5.2503 0.1602
G-WL 12.0280 1.7032 N. A. 4.1699 0.1256 6.8935 0.2965 5.2764 0.1311
QM8 Atomic 0.0310 0.0004 0.0234 0.0005 0.0209 0.0002 0.0322 0.0011 0.0365 0.013
Naive WL 0.0304 0.0009 0.0204 0.0004 0.0238 0.0012 0.0317 0.0009 0.0333 0.0082
C-WL 0.0306 0.0002 0.0214 0.0014 0.0200 0.0002 0.0322 0.0004 0.0340 0.0007
G-WL 0.0292 0.0002 0.0205 0.0004 0.0205 0.0004 0.0316 0.0007 0.0338 0.0013
Table 5: Ten-run averages of mean absolute errors for the regression benchmarks. Smaller values are better. Note that the target values are not normalized across attributes. Bold indicates that WL embedding improves the generalization performance over atomic embedding. Italics indicate the best embedding within each cell. We couldn’t obtain the result of RelGAT + G-WL for the QM9 dataset (“N.A.”) due to the large data size and the memory-intensive formulation of the RelGAT model.
Dataset Embedding GCN RelGAT GGNN GIN NFP
Tox21 Atomic 0.7082 0.0098 0.7432 0.0278 0.7136 0.0131 0.7237 0.0075 0.7618 0.0093
Naive WL 0.7688 0.0060 0.6933 0.0114 0.7278 0.0104 0.6986 0.0115 0.7193 0.0110
C-WL 0.7499 0.0056 0.7609 0.0093 0.7684 0.0125 0.7397 0.0124 0.7425 0.0076
G-WL 0.7707 0.0050 0.7419 0.0084 0.7578 0.0067 0.7584 0.0097 0.7306 0.0105
HIV Atomic 0.7507 0.0199 0.7037 0.0151 0.7044 0.0169 0.7108 0.0075 0.6533 0.0279
Naive WL 0.7286 0.0164 0.6939 0.0222 0.7186 0.0210 0.7044 0.0115 0.7521 0.0082
C-WL 0.7545 0.0073 0.7243 0.0243 0.7183 0.0196 0.7242 0.0141 0.7126 0.0076
G-WL 0.7380 0.0164 0.7117 0.0185 0.7131 0.0132 0.7038 0.0167 0.7517 0.0081
ToxCast Atomic 0.7745 0.0083 0.7820 0.0054 0.7720 0.0110 0.7691 0.0066 0.7585 0.0108
Naive WL 0.7927 0.0062 0.7462 0.0113 0.7498 0.0093 0.7734 0.0112 0.7580 0.0074
C-WL 0.7831 0.0041 0.8013 0.0148 0.7754 0.0007 0.7794 0.0026 0.7708 0.0071
G-WL 0.7788 0.0106 0.7803 0.0002 0.7875 0.0111 0.7744 0.0085 0.7301 0.0083
clintox Atomic 0.9387 0.0071 0.9287 0.0153 0.9338 0.0115 0.9305 0.0092 0.9214 0.0153
Naive WL 0.9370 0.0044 0.9358 0.0001 0.9429 0.0119 0.9255 0.0060 0.9336 0.0074
C-WL 0.9545 0.0054 0.9359 0.0003 0.9438 0.0078 0.9331 0.0142 0.9409 0.0072
G-WL 0.9353 0.0015 0.9205 0.0299 0.9511 0.0062 0.9291 0.0131 0.9358 0.0094
muv Atomic 0.6350 0.0159 0.6164 0.0907 0.6586 0.0295 0.5716 0.0231 0.6839 0.0405
Naive WL 0.7128 0.0214 0.6724 0.02265 0.7324 0.0178 0.6502 0.01792 0.7362 0.0195
C-WL 0.6853 0.0192 0.6677 0.0281 0.7517 0.0007 0.5771 0.0217 0.7127 0.0192
G-WL 0.6806 0.0172 0.6387 0.0326 0.6912 0.0406 0.6003 0.0392 0.7117 0.0192
Table 6: Ten-run averages of ROC-AUCs for the classification benchmarks. Larger values are better. Bold indicates that the WL embedding improves the generalization performance from atomic embedding. Italics indicate the best embedding within each cell.

Appendix G Additional Experiment Result

g.1 Empirical assessment of difficulty of recoverng WLE with atomic embedding + GNN layer

In order to support our theoretical analysis in Sec.4, we empirically study whether one non-linear GNN layer (with a large unit size) plus the atomic embedding can extract the information similar to the WLE.

In this experiment, the atomic embedding is followed by a single layer of nonlinear GNN (more specifically, GCN [Kipf and Welling, 2017]) and the readout layer. In contrast, we apply no GNN layers after the three WLEs (naive, C-WL, and G-WL): we directly apply the readout to the WL embedding vectors.

The ROC-AUC evolution over the dimension of the GCN hidden vectors (for atomic embedding) is presented in Fig. 4. As the figure shows, the GCN with atomic embedding never achieves the AUCs comparable to the WLEs, regardless of the hidden vectors’ dimension. The evolution of the training losses are presented in Fig. 2 of the main manuscript. It is notable that the atomic embedding network cannot decrease the training loss as small as those of WLEs, regardless of the model complexity (hidden dimensions). These results indicate that it is very difficult to emulate WLE-like features by typical MPNNs plus atomic embeddings.

Figure 4: Subgraph Detection experiments, Atomic embedding + single GCN VS. WL Embedding. AUC evolution over hidden dim. of a single GNN layer.

g.2 Analysis of Learned WL Embedding

In this section, we investigate how WL embedding affects to the representation learning in GNNs. As a particular example, we pick up the combination of C-WL embedding and GCN.

g.2.1 Visualization of Learned Weight

We examine the contributions of the atomic embedding vectors and the neighboring embedding vectors in Eq. 3 based on the weight matrix , which we expect to indicate the importance of the embedding vectors by its scale. The left half of corresponds to and the right half of corresponds to . To eliminate the scale invariance between the embedding vectors and the weight, we normalize and and rescale .

Fig. 5 visualizes the absolute values of trained with the QM9 and Tox21 datasets, with which WL embedding made large improvements. The visualization clearly depicts that the weight element for is more active than that for .

Fig. 6 shows the weight matrix of the QM8 dataset and the HIV dataset, where we observe smaller performance improvements by C-WL embedding. We note that the C-WL embedding never zero-suppress weights for , even if we do not achieve large improvements in downstream task performance. The embedding tries to find a good balance between original atom labels and neighboring labels, and this contributes slightly to the downstream tasks.

Figure 5: Visualization of the learned weight matrix of the C-WL Embedding for the QM9 dataset (upper panel) and the Tox21 dataset (lower panel). The absolute values of matrix entries are presented.
Figure 6: Visualization of the learned weight matrix of the C-WL Embedding for the QM8 dataset (upper panel) and the HIV dataset (lower panel). The absolute values of matrix entries are presented. The C-WL embedding records small improvements in these datasets.

g.2.2 Importance of Neighboring Labels

Next, we quantitatively analyze the importance of neighboring labels. We employ node label shuffling inspired by the permutation (feature) importance [Breiman, 2001]. First, we train GNNs with the C-WL embedding in a usual manner. After training, we randomly choose a node for each sample in a test dataset and change its node label or neighboring label to a different label. Therefore, if some specific labels are critically important, the shuffling of them will seriously degrade the prediction performance.

We manipulate the label in two ways: Atom shuffle and Neighboring Label (NL) shuffle. The Atom shuffle replaces the node label by the pseudo one , while the NL shuffle replaces the neighbor label with . Formally, we overwrite embeddings as follows:

Atom Shuffle (17)
NL Shuffle (18)

We compute pseudo labels and so that they are different from and , respectively.

From the results (Table 7), we observe that the NL Shuffle has a larger impact on performances than the Atom shuffle in all datasets. It implies that the neighboring labels are considered as informative cues for molecule property predictions in WL embedding-based GNNs.

Dataset No Shuffle Atom Shuffle NL Shuffle
LIPO 0.6524 0.6718 0.6952
QM9 10.4956 17.2870 17.7292
QM8 0.0306 0.0329 0.0335
Tox21 0.7499 0.7413 0.6637
HIV 0.7545 0.7457 0.7059
Toxcast 0.7831 0.7819 0.7798
Table 7: 10-run averages MAE (upper) and ROC-AUC (lower) on shuffled test data. “No shuffle” results are borrowed Tables 1 and 2. Bold numbers indicate the worst performances.

g.3 Iterative Label Expansion

We empirically evaluate how the iterative label expansion discussed in Section 3.3 affects the model performance. Table 8 shows the performance of models with the number of label expansion iterations varying from to . We use a GCN as a base GNN. All experiment settings are the same as described in Section 6.2, except that we conduct a hyperparameter optimization for every run. We present the result for the QM9, QM8, and the Tox21 dataset.

The results for the QM9 dataset (Table 8) indicate a trade-off of the representation power and model complexity. For the naive WL embedding, we observe that a performance steadily degrades while increasing the number of iterations . Notably, the case is significantly worse than that of . The increased number of expanded labels makes the input feature vectors significantly sparse when (for the QM9 dataset, for while for ).

For C-WL and G-WL embedding, works better than . As described in Section 3.1, naive WL embedding maps different expanded labels at completely different points even if their original labels are the same. This does not hold for the C-WL and the G-WL embedding thanks to the separated and . Therefore, we hypothesize that C-WL and G-WL are more robust to the explosion of extended label patterns than naive WL. In addition, we can infer that the large sample size and the complexity of the QM9 dataset may prevent C-WL and G-WL embedding with from overfitting. In fact, we observe that achieves better results for C-WL and G-WL for the QM8 dataset (Table 9), which has a smaller sample size.

Embedding MAE
Atomic 10 17.0581 1.3320
Naive WL 49 8.6073 0.3279
4,958 8.8921 0.1944
168,225 14.3417 0.3956
C-WL 31 8.5097 0.2675
4,826 8.2116 0.2752
160,758 N.A.
G-WL 31 8.2913 0.2411
4,826 *7.9231 0.1276
160,758 N.A.
Table 8: The number of expanded labels and MAE of the embeddings with iterative label expansions for the QM9 dataset. and denote the number of iterations and the number of (expanded) labels, respectively. We cannot obtain results due to GPU errors for C-WL and G-WL with (“N.A.") . Bold numbers indicate the best models within each model and the asterisk the best model among all configurations.
Model MAE
Atomic 10
Naive WL 49 *0.02866 0.0004
2,968
39,002
C-WL 31 0.02927 0.0002
2,865
38,570 N.A.
G-WL 31 0.02874 0.0002
2,865
38,570 N.A.
Table 9: The number of expanded labels and MAE of the embeddings with iterative label expansions for the QM8 dataset. Symbols are same as those of Table 8.

Table 10 shows the model performances, including correct G-WL, using the Tox21 dataset for various of number of iterations.

Model ROC-AUC
Atomic 84
Naive WL 457 0.7724 0.0047
4,695
25,207
C-WL 237 0.7714 0.0025
4,500 N.A.
24,916 N.A.
G-WL 237 *0.7744 0.0026
4,500 N.A.
24,916 N.A.
Table 10: The number of expanded labels and ROC-AUC of the embeddings with iterative label expansions for the Tox21 dataset. Symbols are same as those of Table 8.