1 Introduction
Following the recent rise of deep learning for image and speech processing, there has been great interest in generalizing convolutional neural networks to arbitrary graphstructured data
[Gilmer et al., 2017, Henaff et al., 2015, Xu et al., 2018]. To this end, graph neural networks (GNN), which fall into either spectralbased or spatialbased approaches, have been proposed. Spectral methods define the graph convolution (GC) as a filtering operator of the graph signal [Defferrard et al., 2016], while spatial methods define the GC as a message passing and aggregation across nodes [Henaff et al., 2015, Xu et al., 2018, Jin et al., 2018]. In drug discovery, GNNs have been very successful across several molecular graph classification and generation tasks. In particular, they outperform predetermined molecular fingerprints and stringbased approaches for molecular property prediction and de novo generation of druglike compounds [Jin et al., 2018, Li et al., 2018b].However, the node feature update performed by most GNNs introduces some important limitations. Indeed, experimental results indicate a performance decrease for deeper GNNs due to the signal smoothing effect of each GC layer [Li et al., 2018a]. This limits the network’s depth and restricts the receptive field of the vertices in the graph to a fewhop neighbourhood, which is insufficient to properly capture local structures, relationships between nodes, and subgraph importance in sparse graphs such as molecules. For example, at least three consecutive GC layers are needed for atoms at the opposite side of a benzene ring to exchange information. This issue is exacerbated by the single global pooling step performed at the end of most GNNs that ignores any hierarchical structure within the graph.
To cope with these limitations, graph coarsening (pooling) methods have been proposed to reduce graph size and enable long distance interaction between nodes. The first proposed methods relies solely on deterministic clustering of the graphs, making them nondifferentiable and taskindependent [Jin et al., 2018, Dafna and Guestrin, 2009, von Luxburg, 2007, Ma et al., 2019]. In contrast, more recent methods use node features but are uninterpretable and, as we will show, are unable to preserve the structure of sparse graphs after pooling [Ying et al., 2018, Gao and Ji, 2018].
Building on theory in graph signal processing, we propose LaPool (Laplacian Pooling), a differentiable pooling method that takes into account both the graph structure and its node features. LaPool performs a dynamic and hierarchical segmentation of graphs by selecting a set of centroid nodes as cluster representatives (leaders), then learns a sparse assignment of the remaining nodes (followers) into these clusters using an attention mechanism. LaPool is compared to other stateoftheart methods in Table 1, with the primary contributions of this paper summarized below:

Using established tools from graph signal processing (GSP), we propose a novel and differentiable pooling module (LaPool) that can be incorporated into existing GNNs to yield more expressive networks.

We show that LaPool outperforms recently proposed graph pooling layers on discriminative and generative learning benchmarks for sparse molecular graphs.

We performed a qualitative assessment of the pooling performed by LaPool to highlight its improved interpretability.
As shown in Figure 1, LaPool enables a better representation of molecular graphs given that the datadriven dynamic segmentation is closely linked to chemical fragmentation [Gordon et al., 2011]. It is also the first GNN method to directly address the issue of model interpretability for sparse graphs.
Property 

Junction Tree  Graph UNet  DiffPool  LaPool  

Uses graph structure  ✓  ✓  ✓  
Datadriven  ✓  ✓  ✓  
Dynamic nb. of clusters  ✓  ✓  
Interpretable  ✓  ✓  ✓ 
1.1 Related Work
In this section, we introduce related work on graph convolutions (GC) and graph pooling, then provide an overview of techniques used in molecular screening and generation.
Increased interest in Graph Neural Networks (GNNs) has resulted in a variety of networks being proposed recently [Gilmer et al., 2017, Henaff et al., 2015, Xu et al., 2018, Klicpera et al., 2018, Hamilton et al., 2017]. As our focus herein is on graph pooling, we refer the readers to [Wu et al., 2019] which reviews recent progress in the field and provides further connection with graph signal processing.
Virtual HighThroughput Screening (VHTS) aims to accurately predict molecular properties directly from molecular structure. It can thus play an important role in the early stages of drug discovery by rapidly triaging the most promising compounds for any given indication, and can further assist during lead compound optimization [Subramaniam et al., 2008]. Importantly, datadriven VHTS approaches that leverage recent advances in deep learning rather than predetermined features such as molecular fingerprints [Rogers and Hahn, 2010] and string representations have been shown to dramatically improve prediction accuracy [Kearnes et al., 2016, Wu et al., 2018].
Advances in generative models for molecular graphs were enabled by deep generative techniques such as variational autoencoders (VAE)
Kingma and Welling [2013], generative adversarial networks (GAN) [Goodfellow et al., 2014], and adversarial autoencoders (AAE) [Makhzani et al., 2015]. The first molecular generative models (e.g. GrammarVAE [Kusner et al., 2017]) resorted to generating string representations of molecules (via SMILES), which resulted in many invalid structures due to the complex syntax of SMILES. Graph generative models have since been developed (e.g. JTVAE [Jin et al., 2018], GraphVAE [Simonovsky and Komodakis, 2018], MolGAN [De Cao and Kipf, 2018], MolMP [Li et al., 2018b], etc.) and have been shown to improve the validity and novelty of generated molecules. In addition, these methods allow conditional molecule generation via Bayesian optimization or reinforcement learning
[Jin et al., 2018, Olivecrona et al., 2017, Assouel et al., 2018, Li et al., 2018d, You et al., 2018a].Following the recent success of graph neural networks, Graph Pooling (GP) methods have been proposed to reduce graph size and increase the receptive field of nodes without increasing network depth. Contrary to the regular structure of images, graphs are irregular and complex, making it challenging to properly pool together nodes. Some graph pooling methods therefore rely on deterministic and nondifferentiable clustering to segment the graph Defferrard et al. [2016], Jin et al. [2018]. In contrast, a differentiable pooling layer (DiffPool) was proposed in [Ying et al., 2018]
to perform a similaritybased node clustering using an affinity matrix learned by GNN, while
[Gao and Ji, 2018] proposed Graph Unet, a sampling method that retains a subset of the nodes at each pooling step but remains differentiable.2 Graph Laplacian Pooling
A reliable pooling operator should maintain the overall structure and connectivity of a graph. LaPool achieves this by taking into account the local structure defined by the neighborhood of each node. As shown in Figure 1, the method uses a standard GC layer with a centroid selection and a follower selection step. First, the centroids of the graph are selected based on the local signal variation (see Section 2.2). Next, LaPool learns an affinity matrix using a distance normalized attention mechanism to assign all nodes of the graph to the centroids (see Section 2.3). Finally, the affinity matrix allows for coarsening the graph into a smaller one. These steps are detailed below.
2.1 Preliminaries
Notation
Let be an undirected graph, where is its vertex set, denotes its adjacency matrix, and is the node feature matrix with each node having dimensional feature . The features can also be viewed as a dimensional signal on [Shuman et al., 2012]. Without loss of generality we may assume a fixed ordering of the nodes that is respected in , , and .
Graph Signal
For any graph , its unnormalized graph Laplacian matrix is defined as , where is a diagonal matrix with being the degree of node . The graph Laplacian is a difference operator and can be used to define the smoothness (the extent at which the signal changes between connected nodes) of a signal on . For a 1dimensional signal :
(1) 
Graph Neural Networks
We consider GNNs that act in the graph spatial domain as message passing [Gilmer et al., 2017]. We focus on the Graph Isomorphism Network (GIN) [Xu et al., 2018], which uses a SUMaggregator on messages received by each node to achieve a better understanding of the graph structure :
(2) 
where is a neural network with trainable parameters ,
is the feature vector for node
, are the neighbours of and is the layer number. Notice the term that takes into account the edge weight between nodes and when is not a binary.In this work, we focus on molecular graphs and mostly place ourselves in a supervised setting where, given a molecule and its corresponding molecular graph , we aim to predict some properties of . Molecular graphs present two particularities: (1) they are often sparse and (2) there is no regularity in the graph signal (nonsmooth variation) as adjacent nodes tend not to have similar features.
2.2 Graph Downsampling via BandPass Filtering
This section details how LaPool downsamples the original graph by selecting a set of nodes as centroids after consecutive GC layers.
Centroid Selection
For any given vertex , we can define a local measure of signal intensity variation around . As measures how different a node is from the average of its neighbours, we are interested in the set of nodes with the highest , corresponding to the high frequencies of the signal.
(3) 
Observe that the GC layers preceding each pooling step perform a smoothing of the graph signal and thus act as a lowpass filter. Combined with the highpass filter of Eq. (3), it results in a bandpass filtering of
that attenuates low and high frequency noise, but retains the important signal in the medium frequencies. The intuition of using the Laplacian maxima for selecting the centroids is that a smooth signal can be very well approximated using a linear interpolation between its local maxima and minima. This is in contrast with most approaches in GSP that use the lower frequencies for signal conservation, but requires the signal to be kbandlimited
[Ma et al., 2019, Chen et al., 2015b, a]. For a 1D signal, LaPool selects points, usually near the maxima/minima, where the derivative changes the most and is hardest to interpolate linearly (see Appendix C for further details). For molecular graphs, this corresponds to sampling a subset of nodes that are critical for reconstructing the original molecule.Dynamic Selection of the Centroids
The method presented in Eq. (3) implies the selection of centroids. In contrast with other methods [Ying et al., 2018, Gao and Ji, 2018], we do not use a fixed value of because the optimal value can be graphdependant and might result in densely located centroids. Instead, we dynamically choose by selecting the nodes where the signal variation is greater than its neighbours :
(4) 
2.3 Learning the NodetoCluster Assignment Matrix
Once the set of centroid nodes is determined, we compute a mapping of the remaining “follower” nodes into the new clusters formed by the nodes in . This mapping gives the cluster assignment s.t. , where each row corresponds to the affinity of node towards each of the clusters in .
Let be the node embedding matrix at an arbitrary layer and the embedding of the “centroids”. We compute using a softattention mechanism [Weng, 2018]
measured by the cosine similarity between
and :(5) 
where is the Kronecker delta and sparsemax is an alternative to the softmax operator [Laha et al., 2016, Martins and Astudillo, 2016], which ensures the sparsity of the attention coefficients and encourages the assignment of each node to a single centroid. This alleviates the need for entropy minimization as done by DiffPool.
Eq. (5) also prevents the selected centroid nodes from being assigned to other clusters. Moreover, notice the term that regularizes the value of the attention for each node. We can define , where is the shortest path distance between each node and centroids in . Although this regularization incurs an additional cost, it will strengthen the affinity to closer centroids.
We explored alternatives without this regularization or by restricting the mapping of each follower to centroids within a fixed khop neighborhood. The results for the various alternatives are presented jointly in Section 3.1.
Finally, after is computed at layer , the coarsened graph is computed using Eq. (6), as in [Ying et al., 2018]. In these equations, is a neural network with trainable parameters that is used to update the embedding of nodes in after the mapping.
(6) 
This process can be repeated by feeding the new graph into another GNN layer.
2.4 Properties of the LaPool Method
Permutation Invariance
Information Preservation
The centroid selection and the sparse distanceregularized node mapping ensure an appropriate segmentation of the graph and prevent the coarsened graph from being fully connected. Although LaPool enforces a hierarchical structure on the graph, it still preserves the information of the original graph after a global sumpooling (GSUMPool).
Proposition 1.
Define, the structureaware feature content of a graph as . Ignoring the feature update performed by LaPool, GSUMPool() = GSUMPool().
Proof.
∎
Emphasizing the Strong Features
Similar to how most CNNs implement a maxpooling layer to emphasize the
strong features, LaPool does so by selecting the high Laplacian as leaders. For molecular graphs, the leaders are biased towards high degree nodes and atoms different than their neighbours (e.g. a Nitrogen in a Carbon ring).3 Results and Discussion
A fundamental objective of LaPool is to learn an interpretable representation of sparse graphs, notably molecular substructures. We argue that this is an essential step towards building neural network models that adequately represent the distribution of molecular data. Indeed, beyond purely discriminative ability, a generative graph network should be able to reconstruct molecular graphs from semantically important substructure components. This stems from the intuition that molecular validity and functional properties derive more from chemical fragments than individual atoms.
Our experimental results thus aim to empirically demonstrate the following properties of LaPool, as benchmarked against current stateoftheart pooling models and the Graph Isomorphism Network.

LaPool’s consideration of semantically important information such as node distance translates to improved performance on molecule substructure prediction tasks

Visualization of LaPool’s behaviour at the pooling layer demonstrates its ability to identify coherent and meaningful molecular substructures

A more coherent pooling layer may lead to better results for supervised tasks such as molecule toxicity prediction

Learning meaningful substructures can be leveraged to construct a generative model which leads to more realistic and feasible molecules
We use the architecture depicted in Appendix A throughout our experiments (see Encoder’s architecture). Furthermore, we note that minimal architectural tuning was performed given that the objective of our experiments is to maintain an even comparison across pooling models. We thus maintained an even network capacity across models, instead performing hyperparameter tuning on poolingspecific variables. Specifically, we optimized over the number of clusters for DiffPool and Graph UNet and over the Laplacian regularization and node neighbourhood parameters () for LaPool.
3.1 Substructure Prediction
While DiffPool and Graph UNet models outperform standard graph convolution networks for supervised tasks on dense graphs [Ying et al., 2018, Gao and Ji, 2018], we expect them to be ineffective at identifying important substructures on sparse graphs as they do not explicitly consider structural relationships. We wish to demonstrate this empirically by extracting known molecular substructure information from the publicly available ^{1}^{1}1These datasets and the full source code for LaPool and all experiments are available at (URL to appear for cameraready version)Tox21 and ChEMBL datasets Council et al. [2007], Gaulton et al. [2011] and evaluating performance in identifying these structures. For the ChEMBL dataset, we use the same subset of approximately 17,000 molecules previously used in Li et al. [2018c] for kinase activity prediction.
As shown in Tables 2 and 3, capturing these structural relationships translates to superior performance of LaPool, as measured across standard metrics on various substructure prediction tasks. We benchmark on different types of substructures and across datasets to verify the robustness of this comparison. We find that for predicting the presence of both 86 molecular fragments arising purely from structural information, as well as 55 structural alerts associated with molecule toxicity, LaPool globally outperforms other pooling models and a baseline GIN for the F1 (micro/macro averaged) and ROCAUC metrics. The different versions of LaPool depicted in the results correspond to the regularization options for the cluster assignment described in 2.3. We note that the distanceregularized version of LaPool yields consistent high performance, suggesting that stronger affinity towards closer centroids often translates into an improved graph representation.
Tox21  ChEMBL  

F1macro  F1micro  ROCAUC  F1macro  F1micro  ROCAUC  
GIN  79.6  83.5  94.5  88.2  96.8  91.8 
DiffPool  79.3  80.9  93.9  86.6  95.6  90.9 
Graph Unet  72.1  72.3  88.3  77.3  87.0  82.9 
LaPool  81.6  85.4  95.0  89.0  97.1  91.9 
LaPool  80.3  84.2  95.1  88.8  96.6  92.4 
LaPool  80.7  86.1  95.1  87.7  96.4  92.2 
Tox21  ChEMBL  

F1macro  F1micro  ROCAUC  F1macro  F1micro  ROCAUC  
GIN  78.9  68.3  72.6  93.6  76.7  59.2 
DiffPool  79.2  68.0  75.6  94.5  83.3  59.3 
Graph Unet  71.1  47.6  67.9  92.9  68.1  59.3 
LaPool  80.6  74.2  73.5  95.2  81.3  59.5 
LaPool  81.3  72.8  74.1  94.1  75.8  58.9 
LaPool  79.1  71.6  74.8  93.8  75.0  59.1 
3.2 Model Interpretability
To better understand the insights provided by LaPool, we investigate the behaviour of the network by plotting the clustering made at the pooling layer level. We believe this provides further insight by highlighting the improved explainability of LaPool compared to other pooling models on fragment prediction tasks.
By analyzing the relationship between the pooling layer used and the vertexcluster attention, we may better understand why LaPool’s pooling is preferable to current methods for identifying meaningful clusters. While defining what is meaningful is inherently subjective, we attempt to shed light on these models by observing their behaviour in the chemical domain, using our understanding of chemical structure as reference. We focus on DiffPool, since its the most similar method, and also because the node sampling performed by Graph Unet ignore the graph structure, usually disconnecting it.
In general, we show in Figure 2 that LaPool is able to coarsen the molecular graphs into robust, sparsely connected graphs, which can be interpreted as the skeleton of the molecules. In contrast, DiffPool’s cluster assignment is much more uniform across the graph, leading to densely connected coarsened graphs which are less interpretable from a chemical viewpoint. Example (c) shows how DiffPool creates a fully connected graph from an originally disconnected graph, and example (b) shows that symmetric elements, despite being far from each other, are assigned identically. Such failures are not present for the proposed LaPool model. A typical failure case for LaPool is seen in (e) and corresponds to a missing leader node in a given region of the graph, which results in a soft assignment of the region to multiple clusters. However, this behaviour is inherent to most DiffPool samples since the fixed number of clusters and the inability to consider node distance cannot account for the diversity of the molecular datasets.
3.3 Toxicity Prediction
In addition to evaluating structural understanding of the pooling models, we benchmark our model on molecular toxicity prediction using the Tox21 dataset. As shown in Table 4, we demonstrate that the improved structural interpretation of LaPool’s pooling mechanism may also lead to an improvement in molecular property prediction, a key performance metric for molecular optimization.
Tox21  

F1macro  F1micro  ROCAUC  
GIN  59.0  24.5  76.6 
DiffPool  59.7  24.3  79.9 
Graph Unet  56.7  18.1  75.3 
LaPool  61.7  31.0  80.7 
LaPool  59.9  25.4  80.7 
LaPool  59.9  26.2  81.8 
3.4 Molecular Generation
We showcase LaPool’s utility in drug discovery by demonstrating that it can be leveraged to generate molecules. In previous work, GANs and VAEs were used to generate either string representations or molecular graphs. Here, we use the GANbased Wasserstein AutoEncoder recently proposed in [Tolstikhin et al., 2017] to model the data distribution of molecules (see Figure B.4 in Appendix). For the encoder, we use a similar network architecture as in our supervised experiments. The decoder and discriminator are simple MLPs, with complete architecture details provided in Appendix A.4. Even though the encoder is permutation invariant, the decoding process might not be. In our particular case, we use a canonicalization algorithm [Schneider et al., 2015] that reorders atoms to ensure a unique graph for each molecule, thus forcing the decoder to learn a single graph ordering. Nevertheless, we further improve the robustness of our generative model to node permutations by computing the reconstruction loss using a permutationinvariant embedding, parameterized by a GIN, on both the input and reconstructed graphs (see Appendix A.4.2). We find that such a formulation improves the reconstruction loss and increases the ratio of valid molecules generated.
Dataset and Baseline Models
Following previous work on molecular generation, we evaluate our generative model with an encoder enhanced by the LaPool layer (referred to as WAELaP) on the QM9 molecular dataset [Ramakrishnan et al., 2014]. This dataset contains 133,885 small druglike organic compounds with up to 9 heavy atoms (C, O, N, F). We compare WAELaP to alternatives within our WAE framework where either no pooling is used (WAEGNN) or DiffPool is used as the pooling layer (WAEDiff). Our results are also compared to previous results on the same dataset, including GrammarVAE, GraphVAE, and MolGAN.
Evaluation Metrics
We measure the performance of the generative model using metrics standard in the field: validity (proportion of valid molecules from generated samples), uniqueness (proportion of unique molecules generated), and novelty (proportion of generated samples not found in the training set). All metrics were computed on a set of 10,000 generated molecules.
GrammarVAE  GraphVAE  MolGAN  WAEGNN  WAEDiff  WAELaP  

% Valid  60.2  91.0  98.1  96.8  97.2  98.8 
% Unique  9.3  24.1  10.4  50.0  29.3  65.5 
% Novel  80.9  61.0  94.2  78.9  78.9  78.4 
As shown in Table 5, WAELaP generated the most valid and unique molecules while MolGAN performed best on the novelty metric. Moreover, as we show in Appendix A.4, LaPool enables the generation of a more diverse set of molecules compared to DiffPool and a plain GNN. We observe that all WAEbased methods produced similar proportions of novel molecules, suggesting that combining LaPool with other generative approaches could improve the uniqueness and validity of generated compounds. We therefore argue that the pooling performed by LaPool can improve molecular graph representation, which is crucial in a generative setting.
4 Conclusion
Building on the literature for graph signal processing, we have derived LaPool, a novel, differentiable, and robust pooling operator for sparse molecular graphs. We have shown that LaPool considers both node information and graph structure during the graph coarsening process. By incorporating the proposed pooling layer into existing graph neural networks, we have demonstrated that the enforced hierarchization allows the resulting network to capture a richer and more relevant set of features at the graphlevel representation. We discussed the performance of LaPool relative to existing graph pooling layers and demonstrated on three molecular classification benchmarks that LaPool outperforms existing graph pooling modules and produces more interpretable results. In particular, we argue that the molecular graph segmentation performed by LaPool provides greater insight into molecular activity and the associated properties that can be leveraged in drug discovery. Finally, we briefly highlight how this new pooling layer can be used to facilitate de novo molecular design. In future work, we aim to further investigate the link between the steps performed by LaPool and the spectral domain of the graph, and how additional sources of information such as edge features could be incorporated into the process. Moreover, although we focused on molecular graphs, it would be of interest to evaluate the performance of LaPool on dense or highly structured graphs.
References
 Assouel et al. [2018] Rim Assouel, Mohamed Ahmed, Marwin H. Segler, Amir Saffari, and Yoshua Bengio. DEFactor: Differentiable Edge Factorizationbased Probabilistic Graph Generation. arXiv:1811.09766 [cs], November 2018. URL http://arxiv.org/abs/1811.09766. arXiv: 1811.09766.
 Chen et al. [2015a] Siheng Chen, Rohan Varma, Aliaksei Sandryhaila, and Jelena Kovačević. Discrete Signal Processing on Graphs: Sampling Theory. IEEE Transactions on Signal Processing, 63(24):6510–6523, December 2015a. ISSN 1053587X, 19410476. doi: 10.1109/TSP.2015.2469645. URL http://arxiv.org/abs/1503.05432. arXiv: 1503.05432.
 Chen et al. [2015b] Siheng Chen, Rohan Varma, Aarti Singh, and Jelena Kovačević. Signal Representations on Graphs: Tools and Applications. arXiv:1512.05406 [cs, math], December 2015b. URL http://arxiv.org/abs/1512.05406. arXiv: 1512.05406.
 Council et al. [2007] National Research Council et al. Toxicity testing in the 21st century: a vision and a strategy. National Academies Press, 2007.
 Dafna and Guestrin [2009] Shahaf Dafna and Carlos Guestrin. Learning Thin Junction Trees via Graph Cuts. In Artificial Intelligence and Statistics, pages 113–120, April 2009. URL http://proceedings.mlr.press/v5/dafna09a.html.
 De Cao and Kipf [2018] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
 Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
 Gao and Ji [2018] Hongyang Gao and Shuiwang Ji. Graph UNet. under review, September 2018. URL https://openreview.net/forum?id=HJePRoAct7.
 Gaulton et al. [2011] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan AlLazikani, et al. Chembl: a largescale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2011.
 Gilmer et al. [2017] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. arXiv:1704.01212 [cs], April 2017. URL http://arxiv.org/abs/1704.01212. arXiv: 1704.01212.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Gordon et al. [2011] Mark S Gordon, Dmitri G Fedorov, Spencer R Pruitt, and Lyudmila V Slipchenko. Fragmentation methods: A route to accurate calculations on large systems. Chemical reviews, 112(1):632–672, 2011.
 Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 Henaff et al. [2015] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep Convolutional Networks on GraphStructured Data. arXiv:1506.05163 [cs], June 2015. URL http://arxiv.org/abs/1506.05163. arXiv: 1506.05163.
 Jin et al. [2018] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction Tree Variational Autoencoder for Molecular Graph Generation. arXiv:1802.04364 [cs, stat], February 2018. URL http://arxiv.org/abs/1802.04364. arXiv: 1802.04364.
 Kadurin et al. [2017] Artur Kadurin, Sergey Nikolenko, Kuzma Khrabrov, Alex Aliper, and Alex Zhavoronkov. drugan: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular pharmaceutics, 14(9):3098–3104, 2017.
 Kearnes et al. [2016] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Klicpera et al. [2018] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. 2018.

Kusner et al. [2017]
Matt J Kusner, Brooks Paige, and José Miguel HernándezLobato.
Grammar variational autoencoder.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1945–1954. JMLR. org, 2017.  Laha et al. [2016] Anirban Laha, Saneem Ahmed Chemmengath, Priyanka Agrawal, Mitesh Khapra, Karthik Sankaranarayanan, and Harish G Ramaswamy. On Controllable Sparse Alternatives to Softmax. page 11, February 2016.
 Li et al. [2018a] Qimai Li, Zhichao Han, and XiaoMing Wu. Deeper Insights into Graph Convolutional Networks for SemiSupervised Learning. arXiv:1801.07606 [cs, stat], January 2018a. URL http://arxiv.org/abs/1801.07606. arXiv: 1801.07606.
 Li et al. [2018b] Yibo Li, Liangren Zhang, and Zhenming Liu. Multiobjective de novo drug design with conditional graph generative model. Journal of Cheminformatics, 10(1):33, dec 2018b. ISSN 17582946. doi: 10.1186/s1332101802876. URL https://jcheminf.springeropen.com/articles/10.1186/s1332101802876.
 Li et al. [2018c] Yibo Li, Liangren Zhang, and Zhenming Liu. Multiobjective de novo drug design with conditional graph generative model. Journal of cheminformatics, 10(1):33, 2018c.
 Li et al. [2018d] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning Deep Generative Models of Graphs. 2018d. ISSN 23268298. doi: 10.1146/annurevstatistics010814020120. URL http://arxiv.org/abs/1803.03324.
 Ma et al. [2019] Yao Ma, Suhang Wang, Charu C. Aggarwal, and Jiliang Tang. Graph Convolutional Networks with EigenPooling. arXiv:1904.13107 [cs, stat], April 2019. URL http://arxiv.org/abs/1904.13107. arXiv: 1904.13107.
 Makhzani et al. [2015] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Martins and Astudillo [2016] André F. T. Martins and Ramón Fernandez Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention and MultiLabel Classification. arXiv:1602.02068 [cs, stat], February 2016. URL http://arxiv.org/abs/1602.02068. arXiv: 1602.02068.
 Olivecrona et al. [2017] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular denovo design through deep reinforcement learning. Journal of Cheminformatics, 9(1):48, September 2017. ISSN 17582946. doi: 10.1186/s133210170235x. URL https://doi.org/10.1186/s133210170235x.
 Polykovskiy et al. [2018] Daniil Polykovskiy, Alexander Zhebrak, Benjamin SanchezLengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, Artur Kadurin, Sergey Nikolenko, Alan AspuruGuzik, and Alex Zhavoronkov. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. arXiv preprint arXiv:1811.12823, 2018.
 Ramakrishnan et al. [2014] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1:140022, 2014.
 Rogers and Hahn [2010] David Rogers and Mathew Hahn. Extendedconnectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
 Schneider et al. [2015] Nadine Schneider, Roger A Sayle, and Gregory A Landrum. Get your atoms in order  an opensource implementation of a novel and robust molecular canonicalization algorithm. Journal of chemical information and modeling, 55(10):2111–2120, 2015.
 Shuman et al. [2012] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. arXiv preprint arXiv:1211.0053, 2012.
 Simonovsky and Komodakis [2018] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pages 412–422. Springer, 2018.
 Subramaniam et al. [2008] Sangeetha Subramaniam, Monica Mehrotra, and Dinesh Gupta. Virtual high throughput screening (vhts)a perspective. Bioinformation, 3(1):14, 2008.
 Tolstikhin et al. [2017] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 von Luxburg [2007] Ulrike von Luxburg. A Tutorial on Spectral Clustering. arXiv:0711.0189 [cs], November 2007. URL http://arxiv.org/abs/0711.0189. arXiv: 0711.0189.
 Weng [2018] Lilian Weng. Attention? Attention!, 2018. URL https://lilianweng.github.io/lillog/2018/06/24/attentionattention.html.
 Wu et al. [2018] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
 Wu et al. [2019] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? arXiv:1810.00826 [cs, stat], October 2018. URL http://arxiv.org/abs/1810.00826. arXiv: 1810.00826.
 Ying et al. [2018] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical Graph Representation Learning with Differentiable Pooling. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 4805–4815, USA, 2018. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=3327345.3327389. eventplace: Montréal, Canada.
 You et al. [2018a] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems, pages 6410–6421, 2018a.
 You et al. [2018b] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep autoregressive models. arXiv preprint arXiv:1802.08773, 2018b.
Appendix A Architecture search and hyperparameter selection
Below, we describe the network architecture and the training process used for the supervised and generative experiments.
a.1 Edge attributes
Some of the work presented assumes the absence of edge attributes in the graphs. However, in molecular graphs, the nature of a bond between two atoms plays an important role regarding activity and property. As such, edge types should be considered in the supervised models. To take this into consideration, we add to our network an initial EdgeGC layer that explicitly takes into account edge attributes. Let be an undirected molecular graph, such that where is the number of nodes in the graph and is the number of possible edge. We have that
(7) 
where is the adjacency matrix of the graph.
The Edge GC layer is defined as follows :
(8) 
where is the concatenation operator on the node feature dimension and are graph neural networks parameterized to learn different features for each edge type. A new graph defined as can then be feed into the subsequent layers of the network.
a.2 Molecular node and edge attributes
In our experiments, the initial node feature tensor is represented by a onehot encoding of 50 atoms (ignoring hydrogens which were removed) within the datasets and additional properties such as the atom implicit valence, its formal charge, number of radical electrons and whether it is in a molecular ring. This results in a
dimensional feature vector for each atom.For edge attributes, we consider the single, double and triple bond, which were enough to cover all molecules of the database. Finally, we kekulize the molecules.
a.3 Supervised experiments
In all of our supervised experiments, we use a graph convolution module consisting of two graph convolutional layers of 64 channels each with identity connection and ReLU activation; followed by a graph pooling layer. This basic module is followed by two additional graph convolution layers and a global sum pooling to yield a graphlevel representation (64). This is further followed two by fully connected layers (FCL) with 64 output channels; finalized by an FCL output layer for the task readouts. Except for the output layer for which a sigmoid activation function is used, we use the ReLU activation function for all other layers.
For DiffPool, we performed a hyperparameter search to find the optimal number of clusters (3, 5, 7, 9). Similarly, a search is also performed for the GraphUnet pooling layer to determine the best number of node to retain (3, 5, 7, 9). The grid values were set in a way that appropriately reflects the size of the molecules in the datasets.
For LaPool, we performed a grid search over the window size used as regularization to prevent nodes from mapping to centroids that are more than hop away and the regularization of the graph signal computed using the graph Laplacian. This latter regularization allows increasing the receptive field of each node when computing the signal variation. Because the signal variation at the nodes can be defined as , by taking , we can measure the signal variation at each node, considering nodes within its neighborhood. The grid search was performed for (0 means no regularization) and .
For the supervised experiments, we use a batch size of 32 and train the networks for 100 epochs.
a.4 Generative models
a.4.1 WAE model
We use a Wasserstein AutoEncoder (WAE) as our generative model (see Figure A.3. The WAE minimizes a penalized form of the Wasserstein distance between a model distribution and a target distribution. It allows using any reconstruction cost function and was shown to improve learning stability.
As described in [Tolstikhin et al., 2017], we aim to minimize the following objective:
(9) 
where is any nonparametric set of probabilistic encoders, is the JensenShannon divergence between the learned latent distribution and prior , and is a hyperparameter.
is estimate using an adversarial training (discriminator).
For our generative model, the encoder follows a similar structure as the network used for our supervised experiments, with the exception being that the network now learns a continuous latent space given a set of input molecular graphs . More precisely, it consists of one edge graph layer, followed by two GCs (32 channels each), an optional graph pooling, then two additional GC layers (64, 64), one global sum pooling step (128) and two FCLs (128), meaning the molecular graphs are embedded into a latent space of dimension 128. Following recent works for graph generation [You et al., 2018b, Assouel et al., 2018, Li et al., 2018d], we tried to model the nodes/edges decoding using an autoregressive framework, aiming to be better capture the interdependency between them. Given the latent code
, such decoding process will iteratively generate a continuous embedding of nodes using a recurrent neural network. However, our preliminary results suggested that this network converge slowly, and did not yield any substantial improvement in reconstruction accuracy compared to using a simple MLP. Therefore, during decoding, we use a simple MLP that takes the latent code
as input an pass it through two FCLs (128, 64). The output of those FCL will be used as shared embedding for two networks: one predicting the full edge tensor, and the second predicting the node features tensor. The network predicting the edge tensor (including edge types) contains two stacked FCLs (64, 64) and an output layer that return the upper triangular entries of the tensor (). The node feature network is a single FCL (32) followed by the output layer.For the discriminator, we use a simple MLP that predicts whether the latent code comes from a normal prior distribution . This MLP is constituted by two stacked FCLs (64, 32) followed by an output layer with sigmoid activation.
As in [Kadurin et al., 2017]
, we do not use batchnormalization, since it resulted in a mismatch between the discriminator and the generator.
All models use the same basic generative architecture, with the only difference being the presence of a poolinglayer and its associated parameters. For DiffPool, we fixed the number of cluster to three, while for LaPool, we use the distancebased regularization for the nodetocluster mapping and no regularization when computing the node signal.
a.4.2 Reconstruction loss
For each input molecular graph , the decoder reconstruct a graph . We define the reconstruction loss as :
(10) 
(11) 
(12) 
(13) 
where , and are respectively the errors for reconstructing the edge type, properly prediction the absence of an edge, and reconstructing the node features.
Since we use a canonical ordering (available in rdkit) to construct from the SMILES representation of molecules, the decoder is forced to learn how to generate a graph under this order. Therefore, the decoding process is not necessarily able to consider permutations on the vertices set, and generation of isomorphic graphs will be heavily penalized in the reconstruction loss. In [Simonovsky and Komodakis, 2018], the authors use an expensive graph matching procedure to overcome that limitation. However, it suffices to compute the reconstruction loss on and , where is a permutation invariant embedding function. Since the Graph Isomorphism Network (GIN) was shown to be invariant to permutation in [Xu et al., 2018], we use an edgeaware GIN layer (see section A.1) with all weights initialized to 1 to embed both and . Then the reconstruction loss is defined as:
(14) 
Our experiments show that this loss function was able to produce a higher number of valid molecules, although we speculate that such a function might prove harder to optimize on datasets with larger graphs.
a.4.3 Training procedure
The QM9 dataset was split into a train (60%), valid (20%) and a holdout test dataset (20%). Only 25% of the training set sampled during each epoch, and in all experiments, we use a batch size of 32. The generator network (encoderdecoder) and the discriminator network are trained independently, using the Adam optimizer Kingma and Ba [2014] with an initial learning rate of for the generator and for the discriminator. During training, we slowly reduce the learning rate by a factor of 0.5, for the generator, on plateau. To stabilize the learning process and prevent the discriminator from becoming "too good" at distinguishing the true data distribution from the prior, we train the generator two times more often.
Appendix B Molecule generation
Here we show how the graph pooling performed by LaPool (WAPLap) yields superior generative models compared to DiffPool (WAEDiff) and nopooling (WAEGNN). We sample 5000 molecules for each generative model and use the MOSES benchmark [Polykovskiy et al., 2018] to access the overall quality of generated molecules by measuring the additional following metrics:

Fragment similarity (Frag) and Scaffold similarity (Scaff) : the cosine distances between vectors of fragment/scaffold frequencies between the generated and the holdout test sets.

Nearest neighbor similarity (SNN): the average similarity of generated molecules to the nearest molecule from the test set.

Internal diversity (IntDiv): the average pairwise similarity of generated molecules.
Valid  Unique  SNN/Test  Frag/Test  Scaf/Test  IntDiv  

WAEGNN  1.0  1.0  0.3292  0.474  0.4761  0.9203 
WAEDiff  1.0  1.0  0.3002  0.133  0.0531  0.9203 
WAELap  1.0  1.0  0.3445  0.6356  0.4393  0.9203 
Appendix C Signal preservation through Laplacian maxima
We illustrate here on a 1d signal , how using the Laplacian maxima serves to retain the most prominent regions of the graph signal, after smoothing (Figure C.5). We measure the energy conservation after downsampling: of the 1d signal energy to highlight why selecting the Laplacian maxima allow reconstructing the signal with a low error when compared to the minimum Laplacian (which focuses on low frequencies). The energy of a discrete signal is defined in (15), and is similar to the energy of a wave in a physical system (without the constants).
(15) 
To mimic the molecular graph signal at the pooling stage, the given signal is built from an 8terms random Fourier series with added Gaussian noise, then smoothed with 2 consecutive neighbor average smoothing. For the pooling methods, a linear interpolation is used to cover the same signal space before computing . We observe that, for the given example, the maxima Laplacian selection seems to minimize the number of leaders required to preserve the signal and the energy, and significantly outperform minima selection.