1 Introduction
Convolutional neural networks (CNNs) have brought significant improvements in many areas such as speech recognition, image classification, video understanding, etc. CNN is mainly applied to regular gridstructured data, where convolution operation can be effectively implemented. However, in other applications, e.g., point clouds, molecules, and social networks, data represented as irregular structures or in nonEuclidean domains, the conventional gridbased convolution is hardly applicable to graphs directly. Such data are usually structured as graphs to depict characteristics of entities and relations among them. Due to the requirements of the locality, stationarity, compositionality of data representation, generalization of CNN from gridstructured data to irregular graphs, i.e., graph convolutional neural networks (GCNNs) [Bruna et al.2013, Henaff et al.2015], is highly desired.
In our view, there are two key components in a GCNN model, the convolution kernels capturing information on both vertices and edges in a graph at the current layer, and the aggregation method transferring the information between different layers. Most researches about GCNN focus on the innovations of the former, and generally adopt the conventional pooling strategies, e.g. max pooling and graph coarsening, to implement the information aggregation. However, the pooling (or equivalent) strategies in the existing GCNN models are usually locality insensitive, which are inappropriate for graph structures where all vertices are not created equally. For example, the molecular scaffold is of vital importance in a molecular graph. Consequently, locality insensitive pooling strategies may destroy important graph structure too early when abstracting a graph progressively.
Improving the performance of GCNNs faces many challenges. For instances, to explore an effective aggregation approach for transferring information between the convolution layers, which is different from the ordinary pooling methods, is more difficult than designing the convolution kernels. Most abstracting methods proposed for image processing can hardly achieve a satisfying result for graphs, which lead us to pay more attention to those graphtransformbased approaches. Although some graph algorithms already have been introduced into the deep learning neural networks, e.g. graph coarsening, most of them are not sensitive to the local structures of a graph and could not detect or maintain the corresponding meaningful information. An important problem is how to avoid destroying useful graph structures too early when abstracting a graph.
To address the issues discussed above, in this paper we propose a graph attribute aggregation network (GAAN) architecture for graphbased inference applications. To effectively capture rich attribute information on both vertices and edges, we design a graph attribute convolution (GAC) operation. To maintain important internal structure during progressively abstracting, we propose a progressive margin folding (PMF) operation. Unlike conventionally used pooling operations in CNNs or GCNNs, PMF is locality sensitive and aims to preserve important structures in the interior of a graph. To summarize, we make the following key contributions:

Different from the conventional pooling operations, a graphtransformationbased aggregation strategy, PMF, is proposed for integrating graph features in the convolution layers at different scales.

By dynamically distinguishing the elements into internal and margin ones, we provide a possible approach to implement the folding iteratively.

A mechanism is devised for preserving the possibly meaningful local structures at the appropriate scales during progressively folding.

A hypergraphbased representation method is introduced for transferring the aggregated information between different convolution layers.
With the proposed GAAN model, we implement several graphbased inference applications. In this paper, we provide the experiments about molecular property prediction including classification and regression tasks. The evaluation, conducted on publicly available benchmarks, demonstrates the advantages of the proposed model over the existing GCNNs.
The rest of this paper is organized as follows. Section 2 summaries the related work in the field of GCNNs. Section 3 presents the details of the proposed GAAN architecture, including GAC and PMF. Section 4 reports our experimental results in details. Section 5 concludes this paper.
2 Related Work
Existing GCNN algorithms can be roughly divided into two categories: spectral domain graph convolution and spatial domain graph convolution.
A mathematically sound definition of convolution operator makes use of the spectral graph theory [Chung and Graham1997]. As one of the first methods to generalize the CNNs to graphstructured data, [Bruna et al.2013] designed the convolution operator by a pointwise product in the spectrum domain according to the convolution theorem. Later, the spectrum filtering based methods using Chebyshev polynomials and Cayley polynomials are proposed in [Defferrard et al.2016] and [Levie et al.2017] respectively. [Kipf and Welling2016] simplified the spectrum method [Defferrard et al.2016] with a firstorder approximation to the Chebyshev polynomials as the graph filter spectrum, which requires much less training parameters. [Henaff et al.2015]
developed an extension of the spectral networks to incorporate a graph estimation procedure. By transferring the intrinsic geometric information learned in the source domain,
[Lee et al.2017] constructed a model for a new but related task in the target domain without collecting new data nor training a new model from scratch. [Feng et al.2018] designed a hyper convolution operation is designed to handle the data correlation during representation learning. [Gilmer et al.2017] reformulated some existing models into a single common framework called Message Passing Neural Networks (MPNNs) and explored additional novel variations within this framework. Besides, [Li et al.2018] constructed unique residual Laplacian matrix for graphstructured data and learned distance metric for graph update.On the other hand, some researchers worked on designing feature propagation models in the spatial domain for GCNNs. [Duvenaud et al.2015] proposed a CNN model that operates directly on graphs for learning molecular fingerprints. [Niepert et al.2016] also proposed a framework for learning CNN for arbitrary graphs. Their framework firstly selects a vertex sequence from a graph via a graph labeling procedure. Then, it assembles and normalizes a local neighborhood graph used as receptive fields. The graph maxpooling and graphgathering layers are designed in [AltaeTran et al.2017] for increasing the size of downstream convolutional layer receptive fields without increasing the number of parameters. [Simonovsky and Komodakis2017] formulated a convolutionlike operation on graph signals performed in the spatial domain where filter weights are conditioned on edge labels (discrete or continuous) and dynamically generated for each specific input sample, and they applied the edgeconditioned convolution (ECC) to point cloud classification. PointNet [Qi et al.2017a] and PointNet++ [Qi et al.2017b], which specially work on point cloud, respect well the permutation invariance of points in the input. [Wang et al.2018] used spectral graph convolution on a local graph, combined with a recursive clustering and pooling strategy.
In the existing GCNN models, the pooling (or equivalent) strategies are mainly inspired by conventional CNNs or graph clustering algorithms. Some GCNNs [Kearnes et al.2016, AltaeTran et al.2017] simply return the maximum activation across a receptive filed, which is analogous to the maxpooling operation in CNNs. In addition, by clustering vertices of a graph in multiple layers, [Defferrard et al.2016, Simonovsky and Komodakis2017] utilize a graph coarsening strategy to achieve a multiscale clustering of the graph.
3 Graph Attribute Aggregation Network
We first present the overall GAAN architecture, and then elaborate on GAC and PMF operations.
3.1 GAAN Architecture
Our GAAN architecture, as shown in Figure 1
, consists of the following types of neural network layers: feature optimization layer, GAC layer, batch normalization layer, PMF layer, and global pooling layer.
Firstly, a feature optimization layer implemented with an autoencoder network is adopted to encode raw sparse features of the input graph with arbitrary structures and sizes. Then the features of the graph in different depths are extracted with multiple combos of a GAC layer, a batch normalization layer [Ioffe and Szegedy2015]
, and a PMF layer. In a GAC layer, the graph convolutional kernels are implemented with specific finite attributes for practical applications. Next the information aggregation of the graph is accomplished hierarchically by the PMF layer, and the newly generated graph is transferred to the next layer. Besides, the similar layer combos are added with different parameter configuration in order. Finally, a global pooling layer averages all the feature vectors of the whole graph. The output of this pooling layer is adopted as the graphlevel representation.
3.2 Graph Attribute Convolution
For a graph , where is a finite set of vertices, is a set of edges, the graph features are defined as the following mappings,
where and correspond to features on vertices and edges, respectively;
indicate the layer in a feedforward neural network; and
and are corresponding feature dimensions. The th layer () is the original input with the associated features and .To deal with topologically weak structures and reduce the burden of reliance on fixed kernel design, e.g. vertex degrees, we devise a graph attribute convolution (GAC) operation by integrating attribute information, as summarized in Algorithm 1. Specifically, we select an intrinsic vertex attribute set with discrete and finite values , and similarly, an intrinsic edge attribute set with . Consequently, all vertices in can be classified into finite groups , such that with attribute value . Similarly, edges are categorized by , i.e, . Then the convolutional feature on a vertex is learned by updating the weight and bias corresponding to its category:
(1) 
Similar convolution operation is conducted on an edge :
(2) 
Then GAC is conducted by fusing the learned features:
(3) 
where is a weight coefficient. We adopt the same number of convolutional kernels to learn the features of vertices and edges, thus they have the same number of output channels of corresponding feature map. Afterward, a nonlinear activation function is applied to the convolutional results.
3.3 Progressive Margin Folding
As discussed before, most convolutional layers in the existing GCNNs generally adopt the offtheshelf pooling strategies lacking the sensitivity of local structures, or specifically, they mainly focus on evolving the features on vertices without changing the underlying graph. To address this issue, we devise a graphtransformationbased strategy, PMF, which aims to aggregate iteratively the information between different layers based on the technique of graph abstraction.
First of all, we distinguish the vertices into internal and margin nodes and propose an algorithm for searching marginal structures. Also preparing for the further folding, we detect the necessary and probably meaningful local structures, such as cycles for molecule graphs (detected with RDKit). Secondly, in each iteration of PMF, we fold inward almost all marginal vertices along the graph periphery and reexamine the marginal structures for the next iteration. It should be noted that we take into consideration not only the marginal vertices but also the local structures including judging the time to process them. Moreover, to finish transferring the complete information from a lower layer to an upper layer, we design a hypergraphbased method representing the current folding results.
For an input graph , PMF constructs a pyramid of progressivelyabstracted graphs, denoted by , , where is the layer index, and is the input graph. The main procedure of PMF is shown in Algorithm 2.
3.3.1 Marginal Structure Search
To implement the folding operation, the marginal structures need to be searched first. The marginal elements include marginal vertices (shown as the red nodes in Figure 2 (b)) and marginal edges. In consideration of the meaningful local structures of a graph, such as the cyclic and acyclic structures, some marginal elements may be embedded in a local structure not just dangling as a leaf. Thus different methods were designed for detecting different marginal elements. In brief, for the acyclic part of a graph, a vertex is identified as marginal vertex if the degree of the vertex is 1 and an edge is identified as a marginal edge if the edge connects the marginal vertex. For the part of a ring structure, a vertex is identified as marginal vertex if its degree is 2 and an edge is identified as a marginal edge if the edge connects two marginal vertices of the ring.
It is worth noting that the concept of marginal vertices is a relative concept. The marginal vertices and edges are selected and updated dynamically in each folding iteration. An internal vertex when all its neighbor are folded in the current iteration may become a marginal vertex in the next iteration of the folding procedure. By folding the marginal vertices iteratively, it can be guaranteed to decrease the graph size.
3.3.2 Dynamic Folding Inward
The folding procedure starts with the input graph. After accomplishing the convolutional computation, we sum and transfer the information of marginal vertices into their inner neighbors. For each marginal vertex with associated edge , we transfer the features of and into its neighbor :
(4)  
where the learnable weight parameters and are mainly used to balance the impact of the current vertex and edge on the feature aggregation. Then the layer with index aggregates vertex features and edge features into . For example, in the first step, the embedding of marginal vertex is folded into the marginal edge. In the next step, the embedding of marginal edge by graph attribute convolution is folded into the inner marginal vertex, indicated as red nodes and arrows in Figure 2 (c), (f), and (i). Finally, a new graph is constructed by trimming the marginal vertices and their incident edges in . Graph is progressively delivered to the next layer for further convolution and folding.
During the folding, the local structures are processed simultaneously. Taking the ring shown in the topright of Figure 3 for an example, it depends on a judgment of timing for processing. Although some vertices in this ring are marginal from the beginning, it should wait for the situation that all of them have been folded except one node connecting with the main part of a graph to collapse the ring into a surrogate vertex (Figure 2 (f)). In other words, a ring is collapsed if there is only one branch or no branch that connects to its nodes. As we adopt the detection algorithm for cycles in advance, more refined decisions are integrated for other connection modes of rings. By such a reduction for local structures in a graph, we could not only obtain the abstraction representation at the proper layer but also simplify the large graph efficiently.
It is obviously that the folding algorithm is different from pruning algorithm. The important information of marginal vertices is transferred to inner vertices instead of discarding. As a consequence, the PMF operation can be applied to generalpurpose graphics and can be modified easily for other specific graphs from many other modalities or domains.
3.3.3 Information Transfer with Hierarchical Representation
Each folding layer aggregates graph features and transfers them from a lower layer to an upper layer. Both vertex folding and ring collapsing need a new representation method for transferring the results of aggregation of the current layer to the upper layer. To maintain the integrity of the information and also to reduce the calculation cost, we introduce a hypergraphbased method for the description. When an iteration of folding in PMF is performed, a hypergraph is dynamically constructed for the updated internal vertices, which contains not only the information from the newly folded marginal vertices and edges but also the information of possibly generated surrogate element representing a local structure. In short, we aggregate features of original vertices and edges into the corresponding hypervertices:
(5) 
where denotes the collapsed ring, the weight parameters and are learned during the model training. Then a hypergraph is constructed by inserting hyperedges according to topological relations between these hypervertices.
In the final layer, the global representation of the whole graph is obtained by integrating the features in the different graphs of the pyramid. The folding in each PMF layer is a global operation for the whole graph, thus PMF avoids the risk of overlocalization. Besides, PMF possesses high generality and can be applied to various topology structures of graphs with preserving the completeness of key substructures.
The complexity of PMF is dependent on the number of margin vertices, which is in the worst case where all vertices are marginal except for the center vertex, i.e., a starlike structure of the underlying graph.
4 Experiments
The GAAN model is evaluated in the classification and regression prediction of molecular properties.
4.1 Multitask Classification and Regression
Prediction of molecular properties plays an important role in virtual drug screening, material design, etc. Obtaining accurate properties typically requires high quality feature representation for molecules. To demonstrate the superiority of our GAAN model on molecular property prediction, we compare it with four stateoftheart GCNNs: () the first spectral CNN (SCNN) [Bruna et al.2013]
with linear Bspline interpolated kernel; (
) neural fingerprint (NFP) [Duvenaud et al.2015] which is the cuttingedge neural network for molecules; () the extension to SCNN, the spectral CNN implemented with Chebyshev polynomialsbased spectral filter (ChebNet) [Defferrard et al.2016]; and () the adaptive graph convolutional neural network (AGCN) [Li et al.2018] for learning adaptive graph topology structure.4.1.1 Datasets
Field  Dataset  Task type  Tasks  Graphs 

Physiology  ClinTox  Classification  2  1491 
SIDER  27  1427  
Tox21  12  8014  
ToxCast  617  8615  
Physical Chemistry  ESOL  Regression  1  1128 
Lipophilicity  1  4200  
NCI  60  19126  
FreeSolv  1  643 
Model  ClinTox  SIDER  Tox21  ToxCast  

Validation  Testing  Validation  Testing  Validation  Testing  Validation  Testing  
SCNN  0.7896  0.7069  0.5806  0.5642  0.7105  0.7023  0.6479  0.6496 
NFP  0.7356  0.7469  0.6049  0.5525  0.7502  0.7341  0.6561  0.6384 
ChebNet  0.8303  0.7573  0.6085  0.5914  0.7540  0.7481  0.6914  0.6739 
AGCN  0.9267  0.8678  0.6112  0.5921  0.7947  0.8016  0.7227  0.7033 
GAAN  0.9384  0.8877  0.6650  0.6575  0.8209  0.8387  0.7376  0.7293 
Model  ESOL  Lipophilicity  NCI  FreeSolv 

SCNN  0.4222 8.38e2  0.7516 8.42e3  0.8695 3.55e3  2.0329 2.70e2 
NFP  0.4955 2.30e3  0.9597 5.70e3  0.8748 7.50e3  3.4082 3.95e2 
ChebNet  0.4665 2.07e3  1.0459 3.92e3  0.8717 4.14e3  2.2868 1.37e2 
AGCN  0.3061 5.34e3  0.7362 3.54e3  0.8647 4.67e3  1.3317 2.73e2 
GAAN  0.2936 4.87e3  0.6045 4.10e3  0.8472 2.20e3  1.0568 2.86e2 
Mean and standard deviation of RMSE on ESOL, Lipophilicity, NCI, and FreeSolv datasets.
Multiple public molecular graph datasets [Wu et al.2018] (as summarized in Table 1) in different fields are adopted to evaluate our GAAN model for thoroughly studying the performance. The ClinTox dataset addresses clinical drug toxicity and consists of two classification tasks: clinical trial toxicity and FDA approval status. The side effect resource (SIDER) [Kuhn et al.2015] is a dataset of marketed drugs and adverse drug reactions. The toxicology in the 21st century (Tox21) for measuring the toxicity of compounds was used in the 2014 Tox21 Data Challenge. It contains qualitative toxicity measurements for 8,014 compounds. The ToxCast [Richard et al.2016] dataset provides qualitative toxicology data of more than 600 experiments on 8615 compounds.
ESOL [Delaney2004] is a dataset consisting of water solubility data for 1,128 compounds. Lipophilicity provides experimental results of octanol/water distribution coefficient of 4,200 compounds. The NCI database includes around 20,000 compounds and 60 prediction tasks from drug reaction experiments to clinical pharmacology studies. The free solvation database (FreeSolv) [Mobley and Guthrie2014] provides experimental and calculated hydration free energy of molecules in water. In our experiments, each dataset is split into training, validation and test subsets following the 8:1:1 ratio.
4.1.2 Network Configuration
In the experiments, the vertex attributes contain atom types, valence, formal charge, etc., and the edge attributes include bond types, same ring, etc. The GAAN architecture can be depicted as GAC(32)PMFGAC(64)PMFGAC(128)PMFGAC(256)PMFGMPTanh, where Tanh is the activation function. The GAAN model is trained with multitask cross entropy loss for no more than 200 epochs. When the molecular features are extracted, a multitask classifier is used to predict more than six hundred molecular property tasks. The learning rate starts from 0.001 and the batch size is also 64 empirically. Early stopping strategy is also adopted to obtain the best network model in training phase.
4.1.3 Experimental Results
To evaluate the classification capability of GCNNs, we compare them on four physiology datasets with more than six hundred classification tasks. We calculate the mean AUCROC scores over all of the tasks for each dataset as the metric for indicating the classification precision. According to the experimental results summarized in Table 2, the proposed GAAN model achieves the best performances on all the datasets. The performances of SCNN model have rooms to improve on the Tox21 dataset. The SCNN and NFP models show similar results on the ToxCast dataset, while the ChebNet model is superior to SCNN and NFP on all datasets. In particular, the AGCN model obtains an significant improvement compared to SCNN, NFP and ChebNet on datasets with small molecular graphs such as Tox21, ClinTox and ToxCast. On the SIDER dataset, AGCN achieves a slight boost because SIDER contains larger molecular graphs than other datasets. In short, the experiment results demonstrate the advantage of the GAAN model in classification tasks.
To evaluate the predictive power, we perform the regression tasks to compare the GAAN model with four stateoftheart GCNNs. The Root Mean Square Error (RMSE) is adopted as the metric of regression tasks. The lower value indicates the better performance. Each experiment is repeated ten times to obtain a stable evaluation result and the standard deviation is calculated from statistical data for each dataset. Experimental results on the four physical chemistry datasets (ESOL, Lipophilicity, NCI, and FreeSolv) presented in Table 3 show that AGCN achieves different degrees of improvement compared to SCNN, NFP, and ChebNet. The NFP model receives a higher RMSE than other models on all datasets except Lipophilicity. The ChebNet model is not as good as others on Lipophilicity. By contrast, our GAAN model demonstrates the significant improvement compared to all other methods on all physical chemistry datasets.
Through the analysis of the experimental results, it is obvious that different design principle leads to different ability of extracting information from graphs. The spectral kernel in the SCNN model only connects onehop neighbors. This overlocalized kernel is unable to cover special structures in graphs such as functional groups in molecules. The ChebNet model extends the onehop to the hop kernel and avoids overlocalization compared to SCNN. But the kernel is still applied among graph structures. It is feasible when graphs share common substructures, such as (carboxyl group) in molecules. However, ChebNet does not work well if there are large quantities of substructures classifying graphs into multiple categories, especially when the scale of graphs is small. This may be the reason why the ChebNet performs decently on large graphs such as the mean ROCAUC 0.5914 on SIDER dataset, but dramatically worse on small graphs, e.g., RMSE 0.465 on ESOL and mean ROCAUC 0.7573 on ClinTox. In addition, with the increase of graph scale, ACGN cannot learn more effective information from graphs as the results shown on SIDER. This indicates that the residual laplacian learning for hidden structures is more useful when data is short. As we illustrated above, GAC involves both global topology and local structure information, and more importantly, PMF is adept at aggregating the hierarchical information of graphs. In other words, the proposed PMF with GAC can promote GCCNs to learn features effectively from both small and large graphs.
5 Conclusion
In this paper, we introduce a graph convolutional network architecture, GAAN, in which PMF is devised to implement the aggregation strategy by iteratively abstracting the interior graph structures at different layers. The experiment results show that our strategy, emphasizing the importance of local sensitivity and capturing the interior graph structures naturally, bring us the improvements in the prediction of molecular property. This approach can be expended for other application for many fields, for example, taking the star structures into consideration for the requirements of social networks. Moreover, many aspects of GCNN are worth exploring. The inference operations in GCCN can be divided and parts of them should be conducted in the early stages of graph convolution based on local graph structures. More mature techniques in the graph field into GCCNs, especially the frequent subgraph mining and subgraph isomorphism. New architectures of GCCN, which deeply combine the convolutional kernels and the aggregation methods according to the dynamic hypergraph structures.
References
 [AltaeTran et al.2017] Han AltaeTran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with oneshot learning. ACS Central Science, 3(4):283–293, 2017.
 [Bruna et al.2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 [Chung and Graham1997] Fan RK Chung and Fan Chung Graham. Spectral graph theory. American Mathematical Soc, 1997.
 [Defferrard et al.2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Annual Conference on Neural Information Processing Systems, pages 3844–3852. MIT Press, 2016.
 [Delaney2004] John S Delaney. Esol: Estimating aqueous solubility directly from molecular structure. Journal of Chemical Information and Computer Sciences, 44(3):1000–1005, 2004.
 [Duvenaud et al.2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Annual Conference on Neural Information Processing Systems, pages 2224–2232. MIT Press, 2015.
 [Feng et al.2018] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. arXiv preprint arXiv:1809.09401, 2018.
 [Gilmer et al.2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
 [Henaff et al.2015] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.

[Ioffe and Szegedy2015]
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, pages 448–456. ACM, 2015.  [Kearnes et al.2016] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of ComputerAided Molecular Design, 30(8):595–608, 2016.
 [Kipf and Welling2016] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [Kuhn et al.2015] Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. The sider database of drugs and side effects. Nucleic Acids Research, 44(D1):D1075–D1079, 2015.

[Lee et al.2017]
Jaekoo Lee, Hyunjae Kim, Jongsun Lee, and Sungroh Yoon.
Transfer learning for deep learning on graphstructured data.
In
AAAI Conference on Artificial Intelligence
, pages 2154–2160. AAAI, 2017.  [Levie et al.2017] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664, 2017.
 [Li et al.2018] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks. In AAAI Conference on Artificial Intelligence, pages 3546–3553. AAAI, 2018.
 [Mobley and Guthrie2014] David L Mobley and J Peter Guthrie. Freesolv: A database of experimental and calculated hydration free energies, with input files. Journal of ComputerAided Molecular Design, 28(7):711–720, 2014.
 [Niepert et al.2016] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning, pages 2014–2023. ACM, 2016.

[Qi et al.2017a]
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
Pointnet: Deep learning on point sets for 3d classification and
segmentation.
In
Conference on Computer Vision and Pattern Recognition
, pages 652–660. IEEE, 2017.  [Qi et al.2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Annual Conference on Neural Information Processing Systems, pages 5099–5108. MIT Press, 2017.
 [Richard et al.2016] Ann M Richard, Richard S Judson, Keith A Houck, Christopher M Grulke, Patra Volarath, Inthirany Thillainadarajah, Chihae Yang, James Rathman, Matthew T Martin, John F Wambaugh, et al. Toxcast chemical landscape: Paving the road to 21st century toxicology. Chemical Research in Toxicology, 29(8):1225–1251, 2016.
 [Simonovsky and Komodakis2017] Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Conference on Computer Vision and Pattern Recognition, pages 3693–3702. IEEE, 2017.
 [Wang et al.2018] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spectral graph convolution for point set feature learning. arXiv preprint arXiv:1803.05827, 2018.
 [Wu et al.2018] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, 2018.