Sparse hierarchical representation learning on molecular graphs

08/06/2019 ∙ by Matthias Bal, et al. ∙ GTN 0

Architectures for sparse hierarchical representation learning have recently been proposed for graph-structured data, but so far assume the absence of edge features in the graph. We close this gap and propose a method to pool graphs with edge features, inspired by the hierarchical nature of chemistry. In particular, we introduce two types of pooling layers compatible with an edge-feature graph-convolutional architecture and investigate their performance for molecules relevant to drug discovery on a set of two classification and two regression benchmark datasets of MoleculeNet. We find that our models significantly outperform previous benchmarks on three of the datasets and reach state-of-the-art results on the fourth benchmark, with pooling improving performance for three out of four tasks, keeping performance stable on the fourth task, and generally speeding up the training process.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and Related Work

Predicting chemical properties of molecules has become a prominent application of neural networks in recent years. A standard approach in chemistry is to conceptualize groups of individual atoms as functional groups with characteristic properties, and infer the properties of a molecule from a multi-level understanding of the interactions between functional groups. This approach reflects the hierarchical nature of the underlying physics and can be formally understood in terms of renormalization (Lin et al., 2017)

. It thus seems natural to use machine learning models that learn graph representations of chemical space in a local and hierarchical manner. This can be realized by coarse-graining the molecular graph in a step-wise fashion, with nodes representing effective objects such as functional groups or rings, connected by effective interactions.

Much published work leverages node locality by using graph-convolutional networks with message passing to process local information, see Gilmer et al. (2017)

for an overview. In graph classification and regression tasks, usually, only a global pooling step is applied to aggregate node features into a feature vector for the entire graph

(Duvenaud et al., 2015; Li et al., 2016; Dai et al., 2016; Gilmer et al., 2017).111In some publications (Altae-Tran et al., 2017; Li et al., 2017a) the phrase “pooling layer” has been used to refer to a max

aggregation step. We reserve the notion of pooling for an operation which creates a true hierarchy of graphs, in line with its usage for images in computer vision.

An alternative is to aggregate node representations into clusters, which are then represented by a coarser graph (Bruna et al., 2013; Niepert et al., 2016; Defferrard et al., 2016; Monti et al., 2017; Simonovsky and Komodakis, 2017; Fey et al., 2018; Mrowca et al., 2018). Early work uses predefined and fixed cluster assignments during training, obtained by a clustering algorithm applied to the input graph. More recently, dynamic cluster assignments are made on learned node features (Ying et al., 2018; Gao and Ji, 2019; Cangea et al., 2018; Gao et al., 2019). A pioneering step in using learnable parameters to cluster and reduce the graph was the DiffPool layer introduced by Ying et al. (2018). Unfortunately, DiffPool is tied to a disadvantageous quadratic memory complexity that limits the size of graphs and cannot be used for large sparse graphs. A sparse, and therefore more efficient, technique has been proposed by Gao and Ji (2019) and further tested and explored by Cangea et al. (2018); Gao et al. (2019).

Sparse pooling layers have so far not been developed for networks on graphs with both node and edge features. This is particularly important for molecular datasets, where edge features may describe different bond types or distances between atoms. When coarsening the molecular graph, new, effective edges need to be created whose edge features represent the effective interactions between the effective nodes. In this paper we explore two types of sparse hierarchical representation learning methods for molecules that process edge features differently during pooling: a simple pooling layer simply aggregates the features of the involved edges, while a more physically-inspired coarse-graining pooling layer determines the effective edge features using neural networks.

We evaluate our approach on established molecular benchmark datasets (Wu et al., 2018), in particular on the regression datasets ESOL and lipophilicity and the classification datasets BBBP and HIV, on which various models have been benchmarked (Li et al., 2017b; B. Goh et al., 2017a, b; B. Goh et al., 2017c; Goh et al., 2018; Shang et al., 2018; B. Goh et al., 2018; Jaeger et al., 2018; Urban et al., 2018; Feinberg et al., 2018; Zheng et al., 2019; Winter et al., 2019). We obtain significantly better results on the datasets ESOL, lipophilicity, and BBBP, and obtain state-of-the-art results on HIV. Simple pooling layers improve results on BBBP and HIV, while coarse-grain pooling improves results on lipophilicity. In general pooling layers can keep performance at least stable while speeding up training.

2. Approach

2.1. Model architecture

We represent input graphs in a sparse representation using node () and edge () feature vectors


where belongs to the set of nearest-neighbours (NN) of . For chemical graphs we encode the atom type as a one-hot vector and its node degree as an additional entry in

, while the bond type is one-hot encoded in

. Framed in the message-passing framework (Gilmer et al., 2017)

,the graph-convolutional models we use consist of alternating message-passing steps to process information locally and pooling steps that reduce the graph to a simpler sub-graph. Finally, a read-out phase gathers the node features and computes a feature vector for the whole graph that is fed through a simple perceptron layer in the final prediction step.

Dual-message graph-convolutional layer  Since edge features are an important part of molecular graphs, the model architecture is chosen to give more prominence to edge features. We design a dual-message graph-convolutional layer that supports both node and edge features and treats them similarly. First, we compute an aggregate message to a target node from all neighbouring source nodes using a fully-connected neural network acting on the source node features and the edge features of the connecting edge. A self-message from the original node features is added to the aggregated result. New node features are computed by applying batch norm (BN) and a ReLU non-linearity, i.e.


In contrast to the pair-message graph-convolutional layer of Gilmer et al. (2017), we also update the edge feature with the closest node feature vectors via


where is a fully-connected neural network and is the edge feature self-message.

RMSE results on ROC-AUC results on
(lr)2-3(lr)4-5 Model ESOL Lipophilicity BBBP HIV
RF 1.07 0.19 0.876 0.040 0.714 0.000
Multitask 1.12 0.15 0.859 0.013 0.688 0.005 0.698 0.037
XGBoost 0.912 0.000 0.799 0.054 0.696 0.000 0.756 0.000
KRR 1.53 0.06 0.899 0.043
GC 0.648 0.019 0.655 0.036 0.690 0.009 0.763 0.016
DAG 0.82 0.08 0.835 0.039
Weave 0.553 0.035 0.715 0.035 0.671 0.014 0.703 0.039
MPNN 0.58 0.03 0.719 0.031
Logreg 0.699 0.002 0.702 0.018
KernelSVM 0.729 0.000 0.792 0.000
IRV 0.700 0.000 0.737 0.000
Bypass 0.702 0.006 0.693 0.026
Chemception (B. Goh et al., 2017c; Goh et al., 2018) 0.752
Smiles2vec (B. Goh et al., 2017a) 0.63 0.8
ChemNet (B. Goh et al., 2017b) 0.8
Dummy super node GC (Li et al., 2017b) 0.766
EAGCN (Shang et al., 2018) 0.61 0.02 0.83 0.01
Mol2vec (Jaeger et al., 2018) 0.79
Outer RNN (Urban et al., 2018) 0.62 0.64
PotentialNet (Feinberg et al., 2018) 0.490 0.014
SA-BILSTM (Zheng et al., 2019) 0.83 0.02
RNN encoder (Winter et al., 2019) 0.58 0.62 0.74
NoPool 0.410 0.023 0.551 0.010 0.846 0.011 0.825 0.008
(lr)1-5 SimplePooling (0.9) 0.410 0.018 0.536 0.009 0.839 0.022 0.824 0.014
SimplePooling (0.8) 0.417 0.027 0.542 0.013 0.869 0.010 0.816 0.020
SimplePooling (0.7) 0.485 0.020 0.563 0.016 0.859 0.009 0.825 0.015
SimplePooling (0.6) 0.413 0.021 0.622 0.030 0.852 0.006 0.840 0.019
SimplePooling (0.5) 0.437 0.016 0.637 0.027 0.851 0.012 0.822 0.019
(lr)1-5 CoarseGrainPooling (0.9) 0.420 0.015 0.517 0.005 0.852 0.010 0.834 0.015
CoarseGrainPooling (0.8) 0.430 0.019 0.529 0.020 0.853 0.009 0.833 0.009
CoarseGrainPooling (0.7) 0.472 0.013 0.530 0.005 0.856 0.012 0.830 0.007
CoarseGrainPooling (0.6) 0.495 0.053 0.536 0.026 0.838 0.020 0.824 0.026
CoarseGrainPooling (0.5) 0.412 0.031 0.535 0.009 0.858 0.023 0.826 0.010
Table 1. (Top) Literature results for the MoleculeNet benchmarks comparing RMSE and ROC-AUC results, on a range of models. Benchmarks without reference come from Wu et al. (2018), except those values decorated with , which come from Feinberg et al. (2018). (Bottom) Our model with coarse-grain pooling, simple pooling, and without pooling. The number in brackets specifies the pooling keep ratio of the pooling layer.

Pooling layer  Pooling layers, as introduced in Gao and Ji (2019), reduce the number of nodes by a fraction


specified as a hyperparameter, via scoring all nodes using a learnable projection vector

, and then selecting the nodes with the highest score . In order to make the projection vector trainable, and thus the node selection differentiable, is also used to determine a gating for each feature vector via


where we only keep the top- nodes and their gated feature vectors .

Pooling nodes requires the creation of new, effective edges between kept nodes while keeping the graph sparse. We discuss in Section 2.2 how to solve this problem in the presence of edge features.

Gather layer  After graph-convolutional and pooling layers, a graph gathering layer is required to map from node and edge features to a global feature vector. Assuming that the dual-message message-passing steps are powerful enough to distribute the information contained in the edge features to the adjacent node features, we gather over node features only by concatenating max and sum, and acting with a non-linearity on the result. All models have an additional linear layer that acts on each node individually before applying the gather layer and a final perceptron layer.

2.2. Pooling with edge features

An important step of the pooling process is to create new edges based on the connectivity of the nodes before pooling in order to keep the graph sufficiently connected. For graphs with edge features this process also has to create new edge features. In addition, the algorithm must be parallel for performance reasons.

We tackle these issues by specifying how to combine edge features into an effective edge feature between remaining (kept) nodes. If a single dropped node or a pair of dropped nodes connect two kept nodes, we construct a new edge and drop the the ones linked to the dropped nodes. (see Fig. 1).

Figure 1. Schematic of a graph pooling step (yellow nodes are kept, blue nodes are dropped). Dangling nodes are removed, together with their connecting edges. Pairs of edges connecting a dropped node to two kept nodes are coarse-grained to a new edge (heavy lines). New edges can also be constructed between kept nodes connected by two dropped nodes (heaviest line).

We propose two layers to calculate the replacement effective edge feature from the dropped edge features. A simple pooling layer computes an effective edge-feature by summing all edge feature vectors along the paths connecting pairs of kept nodes. When multiple paths between a pair of nodes are simultaneously reduced, this method will generate overlapping effective edge features. We reduce these to a single vector of the sum of overlapping edge feature vectors.

We know however that in chemistry effective interactions are more complex functions of the involved component features. Using this as an inspiration, we propose a more expressive coarse-graining pooling layer, which is obtained by replacing the simple aggregation function with neural networks to compute effective edge features. In particular, we use two fully-connected neural networks. The first network maps the atom and adjoining edge feature vectors of dropped nodes to a single effective-edge feature. The second network calculates effective edge features for kept edges (between kept nodes) to account for an effective coarse-grained interaction compensating for deleted nodes.

We use pooling layers after every convolutional layer except for the final one. For convolutional layers, the number of nodes thus gets reduced by a factor . This compression not only gets rid of irrelevant information but also reduces memory requirements and makes training faster, as we show in the experiments in Sec. 3.

3. Experimental Results on MoleculeNet

Model parameters and implementation  We use hyperparameter tuning with the hyperband algorithm (Li et al., 2018) to decide on the number of stacks and channel dimensions of graph-convolutional and pooling layers while keeping the pooling keep ratio defined in equation 7

fixed. All our models were implemented in PyTorch and trained on a single Nvidia Tesla K80 GPU using the

Adam optimizer with a learning rate of .

Evaluation on MoleculeNet  We evaluate our models with and without pooling layers on the MoleculeNet benchmark set (Wu et al., 2018). We focus on four different datasets, comprised of the regression benchmarks ESOL (1128 molecules) and Lipophilicity (4200 molecules), where performance was evaluated by RMSE, and the classification benchmarks on the BBBP (2039 molecules) and HIV (41127 molecules) datasets, evaluated via ROC-AUC. Following Wu et al. (2018), we used a scaffold split for the classification datasets as provided by the DeepChem package. Apart from the benchmarks generated in the original paper, various models have been evaluated on these datasets (Li et al., 2017b; B. Goh et al., 2017a, b; B. Goh et al., 2017c; Goh et al., 2018; Shang et al., 2018; B. Goh et al., 2018; Jaeger et al., 2018; Urban et al., 2018; Feinberg et al., 2018; Zheng et al., 2019; Winter et al., 2019). An overview of the results in the literature can be found in the top of Table 1

. Our results are the mean and standard deviation of 5 runs over 5 random splits (ESOL, Lipophilicity) or 5 runs over the same scaffold split (BBBP, HIV). Datasets were split into training (80%), validation (10%), and held-out test sets (10%). The validation set was used to tune model hyperparameters. All reported metrics are results on the test set. The results of our models with and without pooling are displayed in the lower part of the table.

pooling keep ratio 0.9 0.8 0.7 0.6 0.5
Speed-up 16% 24% 47% 55% 70%
Table 2. Speed-up of pooling runs of the HIV data set using Simple Pooling. The speed-up is measured as increase in speed in terms of elapsed real time compared to the run without pooling layer.

For the regression tasks, we found that our models significantly outperformed previous models for both datasets, with pooling layers keeping performance stable for ESOL and the coarse-grain pooling layer significantly improving results for Lipophilicity (see Table 1). Regarding classification tasks, we found that our models significantly outperformed previous models on BBBP and also exceeded previous benchmarks for the HIV dataset. For both datasets simple pooling layers improved performance. Curiously, the extent to which pooling layers improve performance and which layer is better suited for a particular task strongly depends on the dataset. It seems that simple pooling performs much better for classification tasks while for regression tasks it depends on the dataset.

We also measure the speed-up given by pooling layers during the evaluation on the HIV dataset in terms of elapsed real-time, using the simple pooling layer. The results are displayed in Table 2. We see significant speed-ups for moderate values of the pooling ratio.

4. Conclusion

We introduce two graph-pooling layers for sparse graphs with node and edge features and evaluate their performance on molecular graphs. While our model without pooling significantly outperforms benchmarks on ESOL, lipophilicity and BBBP and reaches state-of-the-art results on HIV in the MoleculeNet dataset, we find that our pooling methods improve performance and provide a speedup of up to 70% in the training of graph-convolutional neural networks that utilize edge features, along with a reduction in memory requirements.

While all experiments have been performed on datasets comprised of small, druglike molecules, we expect even stronger performance for datasets comprised of larger graphs like protein structures, where pooling can create a large, sequential hierarchy of graphs. More generally, our work may result in more pertinent and information-effective latent space representations for graph-based machine learning models.


Appendix A Supplementary Material

a.1. Material science application: Clean Energy Project 2017 dataset

pooling Multi-task Single-task
(lr)2-4(lr)5-5 ratio on PCE on GAP on HOMO on PCE
none 0.862 0.005 0.967 0.001 0.981 0.000 0.866 0.003
0.9 0.863 0.003 0.966 0.001 0.981 0.000 0.862 0.002
0.8 0.860 0.003 0.966 0.001 0.981 0.001 0.859 0.004
0.7 0.856 0.003 0.964 0.001 0.980 0.001 0.855 0.004
0.6 0.854 0.007 0.962 0.002 0.979 0.001 0.853 0.003
0.5 0.844 0.003 0.955 0.002 0.974 0.001 0.833 0.007
none 0.217 0.004 0.177 0.002 0.134 0.001 0.215 0.002
0.9 0.217 0.002 0.180 0.002 0.134 0.001 0.218 0.002
0.8 0.220 0.002 0.179 0.002 0.135 0.003 0.220 0.003
0.7 0.179 0.097 0.148 0.080 0.112 0.060 0.223 0.003
0.6 0.224 0.005 0.190 0.005 0.143 0.004 0.225 0.003
0.5 0.232 0.002 0.208 0.004 0.159 0.003 0.239 0.005
Table 3. Multi-task and single-task benchmark results for power conversion efficiency (PCE), band gap (GAP), and highest occupied molecular orbital (HOMO) energy of the CEP-2017 benchmark for different ratios of kept nodes in each pooling step (averaged over 5 runs, with 5 random splits). Speedup of pooling runs is measured in terms of elapsed real time compared to the run without pooling.

In this section, we propose a regression benchmark for hierarchical models using the 2017 non-fullerene electron-acceptor update (Lopez et al., 2017) to the Clean Energy Project molecular library (Hachmann et al., 2011). We refer to this dataset as CEP-2017. This dataset was generated by combining molecular fragments from a reference library generating 51256 unique molecules. These molecular graphs were then used as input to density functional theory electronic-structure calculations of quantum-mechanical observables (such as GAP and HOMO). Restrictions of the crowd-sourced computing platform limited these structures to molecules of 306 electrons or less. The direct observables quantities are then used in a physically motivated but empirical Scharber (Scharber et al., 2006) model to predict power conversion efficiency (PCE). This efficiency is the ultimate figure of merit for a new photovoltaic material.

We emphasize that this data, generated with an approximate density functional theory method, and then used in an empirical PCE model, lacks predictive power in terms of design of new materials. However a machine learning model built on this data is likely to be transferable to other molecular datasets built on higher level theory (such as coupled-cluster calculations) or experimental ground truth. As we are anticipating this future application of the method, we use the raw (_calc) values rather than the Gaussian process regressed (to a small experimental dataset) values (_calib).

The method of construction of the dataset allows us to highlight the coarse-graining interpretation of the pooling layers introduced in the main text, in terms of the explicit combinatorial building blocks of the non-fullerene electron acceptors.

In Table 3, we show multi-task and single-task test set evaluation results for the power conversion efficiency (PCE), the band gap (GAP), and the highest occupied molecular orbital (HOMO) energy. We used a dual-message graph-convolutional model with three graph-convolutional layers with node channel dimensions and edge channel dimensions with two interleaved layers of simple pooling. We found our model to be a powerful predictor of both fundamental quantum-mechanical properties (GAP and HOMO), and to a lesser extend the more empirical PCE figure. The inclusion of pooling layers resulted in a significant speedup and only a very mild decay in performance.

a.2. Pooling layers illustrations

In Fig. 2(a-c) we visualize the effect of two consecutive pooling layers (each keeping only 50% of the nodes) on a batch of molecules for a DM-SimplePooling model trained on a random split of the CEP-2017 dataset introduced in Sec. A.1. After the first pooling layer (Fig. 2(b)), the model has approximately learned to group rings and identify the backbones or main connected chains of the molecules. After the second pooling layer (Fig. 2(c)), the molecular graphs have been reduced to basic, abstract components connected by chains, encoding a coarse-grained representation of the original molecules. Disconnected parts can be interpreted as a consequence of the aggressive pooling forcing the model to pay attention to the parts it considers most relevant for the task at hand.

Molecular graphs before pooling


Coarse-grained graphs after pooling layer 1.


Coarse-grained graphs after pooling layer 2.

Figure 2. Pooling of molecular graphs (heavy atoms only) sampled from CEP-2017 dataset.
Dataset Model Keep ratio Node Channels Edge Channels
ESOL NoPooling [128, 128] [128, 128]
(lr)2-5 CoarseGrainPooling 0.9 [128, 128] [256, 256]
0.8 [128, 128] [64, 64]
0.7 [512, 512, 512] [128, 128, 128]
0.6 [256, 256, 256] [128, 128, 128]
0.5 [128, 128] [64, 64]
(lr)2-5 SimplePooling 0.9 [256, 256] [128, 128]
0.8 [256, 256, 256] [256, 256, 256]
0.7 [128, 128, 128] [128, 128, 128]
0.6 [512, 512] [128, 128]
0.5 [256, 256] [256, 256]
Lipophilicity NoPooling [256, 256, 256] [64, 64, 64]
(lr)2-5 CoarseGrainPooling 0.9 [256, 256] [128, 128]
0.8 [256, 256] [64, 64]
0.7 [256, 256] [64, 64]
0.6 [256, 256] [256, 256]
0.5 [512, 512] [64, 64]
(lr)2-5 SimplePooling 0.9 [512, 512] [64, 64]
0.8 [128, 128] [128, 128]
0.7 [256, 256] [128, 128]
0.6 [512, 512] [128, 128]
0.5 [256, 256] [128, 128]
BBBP NoPooling [128, 128] [256, 256]
(lr)2-5 CoarseGrainPooling 0.9 [256, 256] [256, 256]
0.8 [512, 512] [256, 256]
0.7 [128, 128, 128] [128, 128, 128]
0.6 [256, 256] [64, 64]
0.5 [256, 256, 256] [64, 64, 64]
(lr)2-5 SimplePooling 0.9 [128, 128, 128] [256, 256, 256]
0.8 [512, 512] [128, 128]
0.7 [256, 256] [64, 64]
0.6 [512, 512] [256, 256]
0.5 [128, 128, 128] [64, 64, 64]
HIV All models [512, 512, 512] [128, 128, 128]
Table 4. Graph-convolutional model hyperparameters used in this work.