Graph Attribute Aggregation Network with Progressive Margin Folding

05/14/2019 ∙ by Penghui Sun, et al. ∙ Temple University Peking University 6

Graph convolutional neural networks (GCNNs) have been attracting increasing research attention due to its great potential in inference over graph structures. However, insufficient effort has been devoted to the aggregation methods between different convolution graph layers. In this paper, we introduce a graph attribute aggregation network (GAAN) architecture. Different from the conventional pooling operations, a graph-transformation-based aggregation strategy, progressive margin folding, PMF, is proposed for integrating graph features. By distinguishing internal and margin elements, we provide an approach for implementing the folding iteratively. And a mechanism is also devised for preserving the local structures during progressively folding. In addition, a hypergraph-based representation is introduced for transferring the aggregated information between different layers. Our experiments applied to the public molecule datasets demonstrate that the proposed GAAN outperforms the existing GCNN models with significant effectiveness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) have brought significant improvements in many areas such as speech recognition, image classification, video understanding, etc. CNN is mainly applied to regular grid-structured data, where convolution operation can be effectively implemented. However, in other applications, e.g., point clouds, molecules, and social networks, data represented as irregular structures or in non-Euclidean domains, the conventional grid-based convolution is hardly applicable to graphs directly. Such data are usually structured as graphs to depict characteristics of entities and relations among them. Due to the requirements of the locality, stationarity, compositionality of data representation, generalization of CNN from grid-structured data to irregular graphs, i.e., graph convolutional neural networks (GCNNs) [Bruna et al.2013, Henaff et al.2015], is highly desired.

In our view, there are two key components in a GCNN model, the convolution kernels capturing information on both vertices and edges in a graph at the current layer, and the aggregation method transferring the information between different layers. Most researches about GCNN focus on the innovations of the former, and generally adopt the conventional pooling strategies, e.g. max pooling and graph coarsening, to implement the information aggregation. However, the pooling (or equivalent) strategies in the existing GCNN models are usually locality insensitive, which are inappropriate for graph structures where all vertices are not created equally. For example, the molecular scaffold is of vital importance in a molecular graph. Consequently, locality insensitive pooling strategies may destroy important graph structure too early when abstracting a graph progressively.

Improving the performance of GCNNs faces many challenges. For instances, to explore an effective aggregation approach for transferring information between the convolution layers, which is different from the ordinary pooling methods, is more difficult than designing the convolution kernels. Most abstracting methods proposed for image processing can hardly achieve a satisfying result for graphs, which lead us to pay more attention to those graph-transform-based approaches. Although some graph algorithms already have been introduced into the deep learning neural networks, e.g. graph coarsening, most of them are not sensitive to the local structures of a graph and could not detect or maintain the corresponding meaningful information. An important problem is how to avoid destroying useful graph structures too early when abstracting a graph.

To address the issues discussed above, in this paper we propose a graph attribute aggregation network (GAAN) architecture for graph-based inference applications. To effectively capture rich attribute information on both vertices and edges, we design a graph attribute convolution (GAC) operation. To maintain important internal structure during progressively abstracting, we propose a progressive margin folding (PMF) operation. Unlike conventionally used pooling operations in CNNs or GCNNs, PMF is locality sensitive and aims to preserve important structures in the interior of a graph. To summarize, we make the following key contributions:

  • Different from the conventional pooling operations, a graph-transformation-based aggregation strategy, PMF, is proposed for integrating graph features in the convolution layers at different scales.

  • By dynamically distinguishing the elements into internal and margin ones, we provide a possible approach to implement the folding iteratively.

  • A mechanism is devised for preserving the possibly meaningful local structures at the appropriate scales during progressively folding.

  • A hypergraph-based representation method is introduced for transferring the aggregated information between different convolution layers.

With the proposed GAAN model, we implement several graph-based inference applications. In this paper, we provide the experiments about molecular property prediction including classification and regression tasks. The evaluation, conducted on publicly available benchmarks, demonstrates the advantages of the proposed model over the existing GCNNs.

The rest of this paper is organized as follows. Section 2 summaries the related work in the field of GCNNs. Section 3 presents the details of the proposed GAAN architecture, including GAC and PMF. Section 4 reports our experimental results in details. Section 5 concludes this paper.

2 Related Work

Existing GCNN algorithms can be roughly divided into two categories: spectral domain graph convolution and spatial domain graph convolution.

A mathematically sound definition of convolution operator makes use of the spectral graph theory [Chung and Graham1997]. As one of the first methods to generalize the CNNs to graph-structured data, [Bruna et al.2013] designed the convolution operator by a point-wise product in the spectrum domain according to the convolution theorem. Later, the spectrum filtering based methods using Chebyshev polynomials and Cayley polynomials are proposed in [Defferrard et al.2016] and [Levie et al.2017] respectively. [Kipf and Welling2016] simplified the spectrum method [Defferrard et al.2016] with a first-order approximation to the Chebyshev polynomials as the graph filter spectrum, which requires much less training parameters. [Henaff et al.2015]

developed an extension of the spectral networks to incorporate a graph estimation procedure. By transferring the intrinsic geometric information learned in the source domain,

[Lee et al.2017] constructed a model for a new but related task in the target domain without collecting new data nor training a new model from scratch. [Feng et al.2018] designed a hyper convolution operation is designed to handle the data correlation during representation learning. [Gilmer et al.2017] reformulated some existing models into a single common framework called Message Passing Neural Networks (MPNNs) and explored additional novel variations within this framework. Besides, [Li et al.2018] constructed unique residual Laplacian matrix for graph-structured data and learned distance metric for graph update.

On the other hand, some researchers worked on designing feature propagation models in the spatial domain for GCNNs. [Duvenaud et al.2015] proposed a CNN model that operates directly on graphs for learning molecular fingerprints. [Niepert et al.2016] also proposed a framework for learning CNN for arbitrary graphs. Their framework firstly selects a vertex sequence from a graph via a graph labeling procedure. Then, it assembles and normalizes a local neighborhood graph used as receptive fields. The graph max-pooling and graph-gathering layers are designed in [Altae-Tran et al.2017] for increasing the size of downstream convolutional layer receptive fields without increasing the number of parameters. [Simonovsky and Komodakis2017] formulated a convolution-like operation on graph signals performed in the spatial domain where filter weights are conditioned on edge labels (discrete or continuous) and dynamically generated for each specific input sample, and they applied the edge-conditioned convolution (ECC) to point cloud classification. PointNet [Qi et al.2017a] and PointNet++ [Qi et al.2017b], which specially work on point cloud, respect well the permutation invariance of points in the input. [Wang et al.2018] used spectral graph convolution on a local graph, combined with a recursive clustering and pooling strategy.

In the existing GCNN models, the pooling (or equivalent) strategies are mainly inspired by conventional CNNs or graph clustering algorithms. Some GCNNs [Kearnes et al.2016, Altae-Tran et al.2017] simply return the maximum activation across a receptive filed, which is analogous to the max-pooling operation in CNNs. In addition, by clustering vertices of a graph in multiple layers, [Defferrard et al.2016, Simonovsky and Komodakis2017] utilize a graph coarsening strategy to achieve a multi-scale clustering of the graph.

3 Graph Attribute Aggregation Network

We first present the overall GAAN architecture, and then elaborate on GAC and PMF operations.

3.1 GAAN Architecture

Our GAAN architecture, as shown in Figure 1

, consists of the following types of neural network layers: feature optimization layer, GAC layer, batch normalization layer, PMF layer, and global pooling layer.

Firstly, a feature optimization layer implemented with an auto-encoder network is adopted to encode raw sparse features of the input graph with arbitrary structures and sizes. Then the features of the graph in different depths are extracted with multiple combos of a GAC layer, a batch normalization layer [Ioffe and Szegedy2015]

, and a PMF layer. In a GAC layer, the graph convolutional kernels are implemented with specific finite attributes for practical applications. Next the information aggregation of the graph is accomplished hierarchically by the PMF layer, and the newly generated graph is transferred to the next layer. Besides, the similar layer combos are added with different parameter configuration in order. Finally, a global pooling layer averages all the feature vectors of the whole graph. The output of this pooling layer is adopted as the graph-level representation.

3.2 Graph Attribute Convolution

For a graph , where is a finite set of vertices, is a set of edges, the graph features are defined as the following mappings,

where and correspond to features on vertices and edges, respectively;

indicate the layer in a feed-forward neural network; and

and are corresponding feature dimensions. The -th layer () is the original input with the associated features and .

Input: Graph Batch
   Weight   Bias
Output: Feature Graphs
1 for  do
2       GetAttributeValue(, )
3       Classify(, , , )
4       for  do
5            
6            
7      for  do
8            
9            
10      
11      

// activation function

return
Algorithm 1 Graph Attribute Convolution (GAC)

To deal with topologically weak structures and reduce the burden of reliance on fixed kernel design, e.g. vertex degrees, we devise a graph attribute convolution (GAC) operation by integrating attribute information, as summarized in Algorithm 1. Specifically, we select an intrinsic vertex attribute set with discrete and finite values , and similarly, an intrinsic edge attribute set with . Consequently, all vertices in can be classified into finite groups , such that with attribute value . Similarly, edges are categorized by , i.e, . Then the convolutional feature on a vertex is learned by updating the weight and bias corresponding to its category:

(1)

Similar convolution operation is conducted on an edge :

(2)

Then GAC is conducted by fusing the learned features:

(3)

where is a weight coefficient. We adopt the same number of convolutional kernels to learn the features of vertices and edges, thus they have the same number of output channels of corresponding feature map. Afterward, a non-linear activation function is applied to the convolutional results.

3.3 Progressive Margin Folding

As discussed before, most convolutional layers in the existing GCNNs generally adopt the off-the-shelf pooling strategies lacking the sensitivity of local structures, or specifically, they mainly focus on evolving the features on vertices without changing the underlying graph. To address this issue, we devise a graph-transformation-based strategy, PMF, which aims to aggregate iteratively the information between different layers based on the technique of graph abstraction.

First of all, we distinguish the vertices into internal and margin nodes and propose an algorithm for searching marginal structures. Also preparing for the further folding, we detect the necessary and probably meaningful local structures, such as cycles for molecule graphs (detected with RDKit). Secondly, in each iteration of PMF, we fold inward almost all marginal vertices along the graph periphery and re-examine the marginal structures for the next iteration. It should be noted that we take into consideration not only the marginal vertices but also the local structures including judging the time to process them. Moreover, to finish transferring the complete information from a lower layer to an upper layer, we design a hypergraph-based method representing the current folding results.

For an input graph , PMF constructs a pyramid of progressively-abstracted graphs, denoted by , , where is the layer index, and is the input graph. The main procedure of PMF is shown in Algorithm 2.

Input: Feature graph batch
Output: Abstracted graphs
1 for  do
2       GetMarginalStructure()
3       for  do
4             GetNeighbor()
5            
6            
7            
8      if branch of ring1 then
9             ring collapsing
10            
11      hypergraph generating
return
Algorithm 2 Progressive Margin Folding (PMF)

3.3.1 Marginal Structure Search

To implement the folding operation, the marginal structures need to be searched first. The marginal elements include marginal vertices (shown as the red nodes in Figure 2 (b)) and marginal edges. In consideration of the meaningful local structures of a graph, such as the cyclic and acyclic structures, some marginal elements may be embedded in a local structure not just dangling as a leaf. Thus different methods were designed for detecting different marginal elements. In brief, for the acyclic part of a graph, a vertex is identified as marginal vertex if the degree of the vertex is 1 and an edge is identified as a marginal edge if the edge connects the marginal vertex. For the part of a ring structure, a vertex is identified as marginal vertex if its degree is 2 and an edge is identified as a marginal edge if the edge connects two marginal vertices of the ring.

It is worth noting that the concept of marginal vertices is a relative concept. The marginal vertices and edges are selected and updated dynamically in each folding iteration. An internal vertex when all its neighbor are folded in the current iteration may become a marginal vertex in the next iteration of the folding procedure. By folding the marginal vertices iteratively, it can be guaranteed to decrease the graph size.

Figure 2: Sample steps of PMF. (a) Original graph ; (b) Marginal structure ; (c) Folding . (d) ; (e) Marginal structure ; (f) Folding ; (g) ; (h) Marginal structure ; (i) Folding .

3.3.2 Dynamic Folding Inward

The folding procedure starts with the input graph. After accomplishing the convolutional computation, we sum and transfer the information of marginal vertices into their inner neighbors. For each marginal vertex with associated edge , we transfer the features of and into its neighbor :

(4)

where the learnable weight parameters and are mainly used to balance the impact of the current vertex and edge on the feature aggregation. Then the layer with index aggregates vertex features and edge features into . For example, in the first step, the embedding of marginal vertex is folded into the marginal edge. In the next step, the embedding of marginal edge by graph attribute convolution is folded into the inner marginal vertex, indicated as red nodes and arrows in Figure 2 (c), (f), and (i). Finally, a new graph is constructed by trimming the marginal vertices and their incident edges in . Graph is progressively delivered to the next layer for further convolution and folding.

During the folding, the local structures are processed simultaneously. Taking the ring shown in the top-right of Figure 3 for an example, it depends on a judgment of timing for processing. Although some vertices in this ring are marginal from the beginning, it should wait for the situation that all of them have been folded except one node connecting with the main part of a graph to collapse the ring into a surrogate vertex (Figure 2 (f)). In other words, a ring is collapsed if there is only one branch or no branch that connects to its nodes. As we adopt the detection algorithm for cycles in advance, more refined decisions are integrated for other connection modes of rings. By such a reduction for local structures in a graph, we could not only obtain the abstraction representation at the proper layer but also simplify the large graph efficiently.

It is obviously that the folding algorithm is different from pruning algorithm. The important information of marginal vertices is transferred to inner vertices instead of discarding. As a consequence, the PMF operation can be applied to general-purpose graphics and can be modified easily for other specific graphs from many other modalities or domains.

3.3.3 Information Transfer with Hierarchical Representation

Each folding layer aggregates graph features and transfers them from a lower layer to an upper layer. Both vertex folding and ring collapsing need a new representation method for transferring the results of aggregation of the current layer to the upper layer. To maintain the integrity of the information and also to reduce the calculation cost, we introduce a hypergraph-based method for the description. When an iteration of folding in PMF is performed, a hypergraph is dynamically constructed for the updated internal vertices, which contains not only the information from the newly folded marginal vertices and edges but also the information of possibly generated surrogate element representing a local structure. In short, we aggregate features of original vertices and edges into the corresponding hypervertices:

(5)

where denotes the collapsed ring, the weight parameters and are learned during the model training. Then a hypergraph is constructed by inserting hyperedges according to topological relations between these hypervertices.

In the final layer, the global representation of the whole graph is obtained by integrating the features in the different graphs of the pyramid. The folding in each PMF layer is a global operation for the whole graph, thus PMF avoids the risk of over-localization. Besides, PMF possesses high generality and can be applied to various topology structures of graphs with preserving the completeness of key substructures.

The complexity of PMF is dependent on the number of margin vertices, which is in the worst case where all vertices are marginal except for the center vertex, i.e., a star-like structure of the underlying graph.

4 Experiments

The GAAN model is evaluated in the classification and regression prediction of molecular properties.

4.1 Multi-task Classification and Regression

Prediction of molecular properties plays an important role in virtual drug screening, material design, etc. Obtaining accurate properties typically requires high quality feature representation for molecules. To demonstrate the superiority of our GAAN model on molecular property prediction, we compare it with four state-of-the-art GCNNs: () the first spectral CNN (SCNN) [Bruna et al.2013]

with linear B-spline interpolated kernel; (

) neural fingerprint (NFP) [Duvenaud et al.2015] which is the cutting-edge neural network for molecules; () the extension to SCNN, the spectral CNN implemented with Chebyshev polynomials-based spectral filter (ChebNet) [Defferrard et al.2016]; and () the adaptive graph convolutional neural network (AGCN) [Li et al.2018] for learning adaptive graph topology structure.

4.1.1 Datasets

Field Dataset Task type Tasks Graphs
Physiology ClinTox Classification 2 1491
SIDER 27 1427
Tox21 12 8014
ToxCast 617 8615
Physical Chemistry ESOL Regression 1 1128
Lipophilicity 1 4200
NCI 60 19126
FreeSolv 1 643
Table 1: Datasets details.
Model ClinTox SIDER Tox21 ToxCast
Validation Testing Validation Testing Validation Testing Validation Testing
SCNN 0.7896 0.7069 0.5806 0.5642 0.7105 0.7023 0.6479 0.6496
NFP 0.7356 0.7469 0.6049 0.5525 0.7502 0.7341 0.6561 0.6384
ChebNet 0.8303 0.7573 0.6085 0.5914 0.7540 0.7481 0.6914 0.6739
AGCN 0.9267 0.8678 0.6112 0.5921 0.7947 0.8016 0.7227 0.7033
GAAN 0.9384 0.8877 0.6650 0.6575 0.8209 0.8387 0.7376 0.7293
Table 2: Mean ROC-AUC scores on ClinTox, SIDER, Tox21, and ToxCast datasets.
Model ESOL Lipophilicity NCI FreeSolv
SCNN 0.4222 8.38e-2 0.7516 8.42e-3 0.8695 3.55e-3 2.0329 2.70e-2
NFP 0.4955 2.30e-3 0.9597 5.70e-3 0.8748 7.50e-3 3.4082 3.95e-2
ChebNet 0.4665 2.07e-3 1.0459 3.92e-3 0.8717 4.14e-3 2.2868 1.37e-2
AGCN 0.3061 5.34e-3 0.7362 3.54e-3 0.8647 4.67e-3 1.3317 2.73e-2
GAAN 0.2936 4.87e-3 0.6045 4.10e-3 0.8472 2.20e-3 1.0568 2.86e-2
Table 3:

Mean and standard deviation of RMSE on ESOL, Lipophilicity, NCI, and FreeSolv datasets.

Multiple public molecular graph datasets [Wu et al.2018] (as summarized in Table 1) in different fields are adopted to evaluate our GAAN model for thoroughly studying the performance. The ClinTox dataset addresses clinical drug toxicity and consists of two classification tasks: clinical trial toxicity and FDA approval status. The side effect resource (SIDER) [Kuhn et al.2015] is a dataset of marketed drugs and adverse drug reactions. The toxicology in the 21st century (Tox21) for measuring the toxicity of compounds was used in the 2014 Tox21 Data Challenge. It contains qualitative toxicity measurements for 8,014 compounds. The ToxCast [Richard et al.2016] dataset provides qualitative toxicology data of more than 600 experiments on 8615 compounds.

ESOL [Delaney2004] is a dataset consisting of water solubility data for 1,128 compounds. Lipophilicity provides experimental results of octanol/water distribution coefficient of 4,200 compounds. The NCI database includes around 20,000 compounds and 60 prediction tasks from drug reaction experiments to clinical pharmacology studies. The free solvation database (FreeSolv) [Mobley and Guthrie2014] provides experimental and calculated hydration free energy of molecules in water. In our experiments, each dataset is split into training, validation and test subsets following the 8:1:1 ratio.

4.1.2 Network Configuration

In the experiments, the vertex attributes contain atom types, valence, formal charge, etc., and the edge attributes include bond types, same ring, etc. The GAAN architecture can be depicted as GAC(32)-PMF-GAC(64)-PMF-GAC(128)-PMF-GAC(256)-PMF-GMP-Tanh, where Tanh is the activation function. The GAAN model is trained with multitask cross entropy loss for no more than 200 epochs. When the molecular features are extracted, a multitask classifier is used to predict more than six hundred molecular property tasks. The learning rate starts from 0.001 and the batch size is also 64 empirically. Early stopping strategy is also adopted to obtain the best network model in training phase.

4.1.3 Experimental Results

To evaluate the classification capability of GCNNs, we compare them on four physiology datasets with more than six hundred classification tasks. We calculate the mean AUC-ROC scores over all of the tasks for each dataset as the metric for indicating the classification precision. According to the experimental results summarized in Table 2, the proposed GAAN model achieves the best performances on all the datasets. The performances of SCNN model have rooms to improve on the Tox21 dataset. The SCNN and NFP models show similar results on the ToxCast dataset, while the ChebNet model is superior to SCNN and NFP on all datasets. In particular, the AGCN model obtains an significant improvement compared to SCNN, NFP and ChebNet on datasets with small molecular graphs such as Tox21, ClinTox and ToxCast. On the SIDER dataset, AGCN achieves a slight boost because SIDER contains larger molecular graphs than other datasets. In short, the experiment results demonstrate the advantage of the GAAN model in classification tasks.

To evaluate the predictive power, we perform the regression tasks to compare the GAAN model with four state-of-the-art GCNNs. The Root Mean Square Error (RMSE) is adopted as the metric of regression tasks. The lower value indicates the better performance. Each experiment is repeated ten times to obtain a stable evaluation result and the standard deviation is calculated from statistical data for each dataset. Experimental results on the four physical chemistry datasets (ESOL, Lipophilicity, NCI, and FreeSolv) presented in Table 3 show that AGCN achieves different degrees of improvement compared to SCNN, NFP, and ChebNet. The NFP model receives a higher RMSE than other models on all datasets except Lipophilicity. The ChebNet model is not as good as others on Lipophilicity. By contrast, our GAAN model demonstrates the significant improvement compared to all other methods on all physical chemistry datasets.

Through the analysis of the experimental results, it is obvious that different design principle leads to different ability of extracting information from graphs. The spectral kernel in the SCNN model only connects one-hop neighbors. This over-localized kernel is unable to cover special structures in graphs such as functional groups in molecules. The ChebNet model extends the one-hop to the -hop kernel and avoids over-localization compared to SCNN. But the kernel is still applied among graph structures. It is feasible when graphs share common sub-structures, such as (carboxyl group) in molecules. However, ChebNet does not work well if there are large quantities of sub-structures classifying graphs into multiple categories, especially when the scale of graphs is small. This may be the reason why the ChebNet performs decently on large graphs such as the mean ROC-AUC 0.5914 on SIDER dataset, but dramatically worse on small graphs, e.g., RMSE 0.465 on ESOL and mean ROC-AUC 0.7573 on ClinTox. In addition, with the increase of graph scale, ACGN cannot learn more effective information from graphs as the results shown on SIDER. This indicates that the residual laplacian learning for hidden structures is more useful when data is short. As we illustrated above, GAC involves both global topology and local structure information, and more importantly, PMF is adept at aggregating the hierarchical information of graphs. In other words, the proposed PMF with GAC can promote GCCNs to learn features effectively from both small and large graphs.

5 Conclusion

In this paper, we introduce a graph convolutional network architecture, GAAN, in which PMF is devised to implement the aggregation strategy by iteratively abstracting the interior graph structures at different layers. The experiment results show that our strategy, emphasizing the importance of local sensitivity and capturing the interior graph structures naturally, bring us the improvements in the prediction of molecular property. This approach can be expended for other application for many fields, for example, taking the star structures into consideration for the requirements of social networks. Moreover, many aspects of GCNN are worth exploring. The inference operations in GCCN can be divided and parts of them should be conducted in the early stages of graph convolution based on local graph structures. More mature techniques in the graph field into GCCNs, especially the frequent subgraph mining and subgraph isomorphism. New architectures of GCCN, which deeply combine the convolutional kernels and the aggregation methods according to the dynamic hypergraph structures.

References