Learning Graph While Training: An Evolving Graph Convolutional Neural Network

08/10/2017 ∙ by Ruoyu Li, et al. ∙ The University of Texas at Arlington 0

Convolution Neural Networks on Graphs are important generalization and extension of classical CNNs. While previous works generally assumed that the graph structures of samples are regular with unified dimensions, in many applications, they are highly diverse or even not well defined. Under some circumstances, e.g. chemical molecular data, clustering or coarsening for simplifying the graphs is hard to be justified chemically. In this paper, we propose a more general and flexible graph convolution network (EGCN) fed by batch of arbitrarily shaped data together with their evolving graph Laplacians trained in supervised fashion. Extensive experiments have been conducted to demonstrate the superior performance in terms of both the acceleration of parameter fitting and the significantly improved prediction accuracy on multiple graph-structured datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNNs) have been proven supremely successful on solving a wide variety of machine learning problems

hinton2012deep . The stationarity of data and the metric of grid unlock the possibility of designing a local convolutional kernel that linearly combines local features. With the power of deep architecture, the network can output high-level representation of both local features and universal structures of signal. Even though the CNNs have been successful in tasks where data have underlying grid structure, e.g. text, images and videos, in many problems the data lie on irregular grid or more generally in non-Euclidean domains, e.g. molecular data, social networks and knowledge instances. Those data are better to be structured as graph, which is capable of handling varying node neighborhood connectivity and non-Euclidean metric. Under such a circumstance, the stationarity, locality and compositionality, which allow kernel-based convolution and pooling in CNNs, are no longer satisfied. The classical CNNs cannot directly work on graph-structured data.

However, a generalization of classical CNNs from regular grid to irregular graph is not straightforward. For simplicity of constructing kernel, many previous works assume data is still on low-dimensional graph and the training data has unified graph Laplacian shared across signal domain bruna2013spectral ; henaff2015deep

. As a result, the graphs have to be of identical dimensions, which makes it impossible to construct an end-to-end deep learning pipeline that accepts arbitrary graph inputs. Moreover, current graph convolution layer does not deeply exploit information given by vertex connectivity due to the difficulty of designing a kernel flexible with varying neighborhood

atwood2016diffusion ; chen2016compressing

. Whereas, some sorts of data on non-Euclidean domain, such as molecular data, have underlying graph structure or some prior knowledge of how to construct it, e.g. social network, many others do not have such knowledge. So, it is necessary to estimate the similarity matrix before performing graph convolutions. The state-of-the-art graph construction methods are classified into unsupervised and supervised ones

henaff2015deep . However, both graph constructions are accomplished before feeding data into the network. Therefore, the generated graph structure for the data keeps unchanged and will not be updated during the training procedure bruna2013spectral ; henaff2015deep .

Although the supervised graph construction with fully connected networks has been exploited in DNN henaff2015deep , their dense training weights restrict it to small graphs. Moreover, the graph structure learned from a fully connected architecture is not guaranteed to best serve the convolution neural network. To tackle these challenges, we introduce a new graph convolution layer embedded with metric learning, so that each convolution layer is able to dynamically construct and learn graph structures for each individual data sample in the batch based on the given supervised information. Directly learning the similarity matrix has complexity for a graph of data. If harnessing a supervised metric learning with Mahalanobis distance, we could reduce the parameter number to at most or even . As a consequence, the learning complexity becomes independent of graph size . In classical CNNs, back-propagation generally updates kernel weights to adjust the relationship between neighboring nodes at each feature dimension individually. Then it sums up signals from all filters to construct hidden-layer activations. To grant graph CNNs a similar capability, we propose a re-parameterization on the feature dimension of graph data with additional weights and bias.

Even for those data with inherent graph structure, it is still interesting to ask if the free graphs optimally serve the specific learning tasks based on the supervised information or not. For example, the chemical bonds, connecting a pair of atoms, directly lead to a underlying graph for each chemical compound. It is not hard to find that those chemical connections are not always the optimal information source for predicting desired outputs of specific tasks. Consequently, it is emerging to develop new approach that automatically discovers the hidden and task-related graph structures that boost the performance of graph CNNs for specific task. Motivated by deep residue learning he2016deep , we propose a residual graph Laplacian learning method, which is able to learn an optimal graph structure for each data sample and the prediction neural network simultaneously.

In this paper, we explore our approach primarily on chemical molecular datasets, although the network can be straightforwardly trained on other graph-structured data, such as point cloud, social networks and so on. Our contributions can be summarized as follows:

  • A novel spectral graph convolution layer boosted by Laplacian learning (SGC-LL) has been proposed to dynamically update the residual graph Laplacians via metric learning for deep graph learning.

  • Re-parametrization on feature domain has been introduced in

    -hop spectral graph convolution to enable our proposed deep graph learning and to grant graph CNNs the similar capability of feature extraction on graph data as that in the classical CNNs on grid data.

  • An evolving graph convolution network (EGCN) has been designed to be fed by a batch of arbitrarily shaped graph-structured data. The network is able to construct and learn for each data sample the graph structure that best serves the prediction part of network. Extensive experimental results indicate the benefits from the evolving graph structure of data.

The rest of the paper is organized as follows. Section 2 reviews previous related works. Section 3 introduces the proposed spectral graph convolution boosted by residual Laplacian learning. Section 4 demonstrates both visual and numerical results. Section 5 concludes this paper.

2 Related Work

There have been lots of works that explored local receptive fields on grid krizhevsky2012imagenet ; coates2011selecting with deep learning. However, there are not so many works on generalizing deep convolutional network to graph-structured data. The first trial of formulating CNN analogy on irregular domains modeled as graphs has been accomplished by bruna2013spectral

, who investigated performing convolution on both spatial and spectral domains of graph representations. Their works gave a spatial localized filter by designing smooth spectral kernel constructed by B-spine interpolation, but it only worked on low-dimensional graph.

Figure 1: Network architecture of a spatial graph CNN on graphs (molecular data) compared with classical CNN on grids (images). A simplified spatial graph convolution operator sums up transformed features of neighbor nodes (): we use separate weight and bias for neighbor : ,

. While a graph max pooling keeps the maximum among neighbors (

) and itself node along the feature dimension. Red boxes are the convolution kernel. However, on graphs, the convolution kernel cannot work in the way same as on grid: for nodes, their neighborhood differ in both the degree and the connection type (bond). Better viewed in color print.

henaff2015deep further extended the spectral construction to a larger scale of high-dimensional graphs as well as proposed two graph construction methods in both unsupervised and supervised fashion. Inspired by previous jobs and based on graph signal processing (GSP) shuman2013emerging , defferrard2016convolutional introduced a new spectral graph theoretical formulation and used Chebyshev polynomials and its approximate evaluation scheme to reduce the computational cost and achieve localized filtering. kipf2016semi showed a first-order approximation to the Chebyshev polynomials as the graph filter spectrum, which requires less training parameters.

Besides above papers on constructing convolution layer on graphs, many others studied the problem from a different angle. niepert2016learning first investigated learning a network from a set of heterogeneous graphs to predict node-level feature as well as to do graph completion, although it is based on node sequence selection. atwood2016diffusion introduced a graph diffusion process, which delivers equivalent effect as convolution has, but atwood2016diffusion ’s DCNN has no dependency on the indexing of nodes. Its constrains are the highly restricted locality by diffusion process and the expensive dense matrix multiplication.

Recently, simonovsky2017dynamic investigated a similar problem as ours by learning edge-conditioned feature weight matrix from edge features using a separate filter-generating network de2016dynamic , while simonovsky2017dynamic ’s application is on point cloud classification. There are other studies about learning on graph data such as dai2016discriminative that proposed a kernel embedding methods on feature space for graph-structured data. Another similar work is grover2016node2vec , but their models do not fall into the kingdom of feed-forward CNN analogs on graphs.

For chemical compounds, naturally modeled as graphs, duvenaud2015convolutional ; wallach2015atomnet ; wu2017moleculenet made several successful trials of applying neural networks to learn representations for predictive tasks, which were usually tackled by handcrafted features mayr2016deeptox or hashing weiss2009spectral . Whereas, due to the constraints of spatial convolution, their models failed to make full use of the atom-connectivities, which are more than bond features by Rdkit landrum2013rdkit . More recent explorations on progressive network, multi-task learning and low-shot or one-shot learning have been accomplished altae2016low ; gomes2017atomic . Lastly, Deepchem 111https://github.com/deepchem/deepchem is an outstanding open-source chem-informatics/machine learning benchmark. Our codes and demos were built and tested upon it.

3 Method

3.1 Spatial v.s Spectral Convolution

For constructing convolution operators on graph-structured data, there exist two major approaches: spatial construction and spectral construction. As implied by the name of the two, they separately manipulate spatial and spectrum domain of graph signals. Particularly, spatial convolution purely uses neighborhood information in terms of graph adjacency matrix or similarity matrix . More formally, if at th layer the input data , its output is formulated as bruna2013invariant ; wu2017moleculenet :

(1)

where is a matrix that linearly maps each input feature dimension to output features and possibly . Nonzero entries of are where two nodes connected. Apparently, this model is hard to induce weights shared across spatial domain. The convolution of this type reduces to an analog of fully connected layer with sparse regularization given by on weight matrix . See Fig. 1 for explicit demonstrations of spatial graph convolution and graph max pooling layers.

Compared to spatial construction, spectral graph theory empowers us to build convolution kernel on spectrum domain which is more compact and the spatial locality of kernel is supported by the smoothness of spectrum multipliers. The baseline approach is built upon [Eq(3), defferrard2016convolutional ] which extended the one-hop spatial kernels bruna2013spectral to the kernels that allow

-hop connectivities. According to graph Fourier transform

defferrard2016convolutional , if is graph Fourier basis of :

(2)

where is frequencies of Laplacian . The Eq.(2) brings us an elastic kernel that allows any pair of nodes with shortest path distance to squeeze in. Of course, the far-away connectivity means less similarity and will be assigned less importance by .

Recursively fast filtering. Evaluating Eq.(2) is expensive due to dense matrix multiplication with . For instead, and were approximated by Chebyshev coefficients and polynomial functions. The computation of were replaced by recursive function with and . Then the -hop kernel becomes

still parameterized by vector

of size . Consequently, the entire cost was reduced to from because of the natural sparsity of defferrard2016convolutional .

Re-parameterization on feature domain. One major idea for graph CNN is to exactly reconstruct classical CNN on graphs. This way is tough, because regularly shaped kernel is impossible on graphs. bruna2013invariant ; duvenaud2015convolutional simply bypass building kernels on spatial domain, but give feature transformation conditioned on edge distance bruna2013invariant or even node degree duvenaud2015convolutional . Spectral kernel Eq.(2) is a promising attempt. But it distributes weights in spatial domain similarly in concentric zone model, which is still not as flexible as convolution kernel on grid. Besides, for convolution layer of classical CNNs, outputted activations combine filtered signals from all feature maps in which separate kernels work independently. In other words, they do not only sum up features from their spatial neighbors, but also mine relationships with other feature dimensions. To mimic the classical CNNs, we re-parameterize output of Eq.(2) by a feature domain transformation matrix and a bias . Intuitively, we divide the operations of classical CNNs on both spatial and feature domain into two consecutive stages: 1) compute kernel with ; 2) linearly maps features to another features. The layer after re-parameterization is as below:

(3)

3.2 Graph CNN with Laplacian Learning

The state-of-the-art methods on graph convolution neural networks all utilize graph Laplacian matrix in some way. Normalized graph Laplacian is more often used. Given the adjacency matrix and the degree matrix for graph , the graph Laplacian matrix :

(4)

As we know, defines both node-wise connectivity and degree of vertices. Some types of data have inherent graph structure, such as chemical molecular data. Each molecule is a graph with atoms as vertices and bonds as edges. Those chemical bonds could be verified by experiments and even visible in some cases. But, most of data do not have graph structure given, so we have to construct graphs before feed them to our deep nets. Besides above two cases, the most likely case is that the inherent graphs can not sufficiently express all of the meaningful node-wise connectivities. For example, mayr2016deeptox proposed to predict the toxicity of drugs by learning representations of toxic sub-structures from labeled molecular samples. The graph directly given by SMILES weininger1988smiles sequence does not tell anything about the toxicity. The model has to learn the atom connectivity and to form sub-structures most related to toxicity. The discovered toxic sub-structure may happen to be of the bonds, e.g. Benzene ring, or not at all. Given this, the next question becomes what defines a particularly good distance metric that best describes those hidden connectivities driven by learning tasks.

Figure 2: The composition of Evolving Graph Convolution Neural Network (EGCN) and the residual Laplacian learning scheme of SGC-LL layer. The evolving graph Laplacian . () Projection weight in distance metric get updated by SGC-LL in back-propagation. In feed-forward, use and to compute spectral convolution (Eq.(3)). Better viewed in color print.

Supervised Metric Learning.

In articles of metric learning, the algorithms were divided as supervised and unsupervised learning for metrics

wang2015survey . The unsupervised metric selection picks the metric that works best for clustering data samples. The optimal metric should minimize the intra-cluster distances and also maximize the inter-cluster distances. For datasets come with labels, the quality of metric is determined by the learning loss. Parameterized as part of learning model, the metric converges to the optimal when the learning curve remains stable. Generalized Mahalanobis distance measures the distance between samples and by:

(5)

If , Eq.(5) reduces to Euclidean distance. In proposed EGCN, the symmetric positive semi-definite matrix is the trainable weight of SGC-LL layer. The works as a transform basis to some domain in which we measure the Euclidean distance between and . Then, we use that distance to calculate the Gaussian kernel : . In our case, the optimal transformation matrix will be found by the one who is able to generate the graphs that best fit our learning tasks. Although the distance formulation Eq.(5) seems trivial, it is cheap to compute gradient w.r.t in back-propagation, which is the main source of computations in DNN.

Learning Residual Graph Laplacian As we discussed above, to discover the hidden correlations between nodes in graph, we introduce a parameterized distance Eq.(5) to update the Gaussian similarity matrix (, adjacency matrix after thresholding), and then use updated to compute normalized graph Laplacian (Eq.(4)). Due to the distance parameter is randomly initialized, so it may take long before the model to converge. To accelerate the convergence and increase the stability of our model, we announce a reasonable assumption that the optimal graph Laplacian is a small shifting from the original graph Laplacian , in other words the original graph Laplacian has disclosed a large amount of helpful graph structural information. Consequently, instead of directly learning , we learn the residual graph Laplacian: , so we have :

(6)

The proposed Laplacian-Learning boosted spectral graph convolution layer is fed by mini-batch of arbitrarily shaped graphs, it performs convolution on spectrum domain with -hop elastic kernel of training parameters. is for the weights in distance metric. In Fig. 2, the network consists of two SGC-LL layers, in which the two sets of graph Laplacian

will be updated independently and will probably diverge because the input

is different, and the two layers worked on different feature maps.

In Section. 4

, our experimental results on multiple datasets indicate that for the data with inherent graphs, e.g. drug data given as SMILES sequences, the original Laplacian is quite close to the optimal one. However, those small updates on graph connectivity within 20 epochs significantly raise the performance of model.

plays a role similar to regularization on and its weight is balanced by . For those datasets without given graphs, we could use clustering algorithms, e.g.

-nearest neighbor, spectral clustering

ng2001spectral , to construct graphs in unsupervised fashion. Then using them as initialization of the network is better than purely randomized weights initializer. See Fig. 2 for details of SGC-LL layer and the residual graph Laplacian learning procedures in this layer.

4 Experiments

Network Configuration of EGCN. The proposed network is named as evolving graph convolution networks (EGCN), because it allows graph structure evolves according to the context of learning task. Besides SGC-LL, it has graph max pooling layer and gathering layer gomes2017atomic . The max pooling on graph was performed feature by feature. For each node , the operator replaces the th feature of node with the maximum among the original values from his neighbors and himself: . In graph gather layer, it simply sums up the feature vectors of all nodes and output it as representation of the graph, so we can use it to do graph-level regression or classification. The motivation of embedding a bilateral filter in EGCN is against over-fitting gadde2016superpixel . The evolving graph Laplacian definitely adapts the model to better fit the training data, but, at the risk of over-fitting. To prevent over-fitting, we introduce a revised bilateral filtering layer to regularize activation of SGC-LL layer by augmenting the spatial locality of

. We also introduced batch normalization layers to accelerate the training

ioffe2015batch .

Batch Training of Non-uniformly Shaped Samples. One of the greatest challenges for graph CNN is the different shapes of training graphs: 1) Raises the difficulty of designing kernels, because the invariance of kernel on graphs is not satisfied and the node indexing sometimes matters; 2) Sometimes resizing (clustering) bruna2013invariant is not reasonable for some types of graph like molecular data: it will lose significant atoms along with its features, if perform graph coarsening or pooling; 3) Most of deep learning APIs do not support the training inputs of varying shapes in batch-mode 111I see some new workout released by Google’s Tensorflow looks2017deep , but due to time constraint, we do not try to move our code to that frameworks.

. In this work, we bypassed the tensor shape constraint of Tensorflow by heavy usage of

and . Samples has different number of nodes, so their graph Laplacians definitely differ, but they share all the model parameters. In the experiments, we almost reused the parameter setup for all datasets. Batch size is 256. The optimizer is Adam with exponential decayed (0.9 every 50 iterations) learning rate begins with 0.005. The maximum epoch is 50. We extracted 75 node features and 6 edge features.

Figure 3: Example of residual graph Laplacian learning. Four shots of evolving graph similarity matrix (shown in heat maps) recorded at 5th, 10th,15th and 20th epoch separately. The displayed is from compound Benfuracarb that has 28 atoms.

(a)0.25in.4in (b)0.25in.4in

Figure 4: The weighted loss curve (left) during training and mean RMSE curve (right) from the epoch 050 on validation set of Delaney solubility dataset. Red lines were generated by our "SGC-LL". Compared benchmarks: graphconv bruna2013spectral , gcn defferrard2016convolutional , NFP duvenaud2015convolutional . Better viewed in color print.

4.1 Performance boosted by SGC-LL Layer

The experiment demonstrates the close correlation between evolving graph Laplacian and model fitting. Fig. 3 shows the 4 heat maps of graph similarity matrix , used to compute evolving graph Laplacian, at the second SGC-LL layer. As shown in Fig. 4, the weighted loss dropped quickly during the epoch 520, so did the mean RMSE score. In the meanwhile, the graph Laplacians keep evolving according to gradient back-propagated from next layer. The white circles mark one of the major region of intensities on that changed significantly during the epoch 520. The connections between some pairs of node were reinforced (get lighter), while others got weakened (go darker). Besides, in Fig. 4, the two figures show that the EGCN network equipped with proposed SGC-LL layer (red line) has overwhelmingly better performance in both convergence speed and predictive accuracy. We attribute this improvement to the supervised residual graph Laplacian learning scheme during training. The evolving graph Laplacians, used in spectral graph convolution, fit the data better than the fixed graph Laplacian defferrard2016convolutional ; kipf2016semi .

Datasets Delaney solubility Az-logD NCI
G-CNN bruna2013spectral 0.42225 8.38 0.75160 8.42 0.86958 3.55
NFP duvenaud2015convolutional 0.49546 2.30 0.95971 5.70 0.87482 7.50
GCN defferrard2016convolutional 0.46647 7.07 1.04595 3.92 0.87175 4.14
SGC-LL 0.30607 5.34 0.73624 3.54 0.86474 4.67
Table 1: mean RMSE Scores on Delaney, Az-logD and NCI Datasets
Datasets Tox21 ClinTox Sider Toxcast
Valid Test Valid Test Valid Test Valid Test
G-CNN bruna2013spectral 0.7105 0.7023 0.7896 0.7069 0.5806 0.5642 0.6497 0.6496
NFP duvenaud2015convolutional 0.7502 0.7341 0.7356 0.7469 0.6049 0.5525 0.6561 0.6384
GCN defferrard2016convolutional 0.7540 0.7481 0.8303 0.7573 0.6085 0.5914 0.6914 0.6739
SGC-LL 0.7947 0.8016 0.9267 0.8678 0.6112 0.5921 0.7227 0.7033
Table 2: Task-averaged ROC-AUC Scores on Tox21, ClinTox, Sider Toxcast Datasets

4.2 Prediction on Chemical Molecular Datasets

Delaney Dataset 111Delaney Dataset: http://pubs.acs.org/doi/abs/10.1021/ci034243x contains aequeous solubility data for 1,144 low molecular weight compounds. The complexest compound in the dataset has 492 atoms, while the smallest one only consists of 3 atoms. For organic compound, we set the maximum degree of node as 10. NCI chemical compound database 222NIH-NCI: https://cactus.nci.nih.gov/download/nci/ has around 20,000 training compound samples and 60 prediction tasks from drug reaction experiments to clinical pharmacology studies. At last, Az-logD dataset from ADME vugmeyster2012absorption

is a set of compounds and their logD measurements correlated to permeability. The presented task-averaged RMSE scores and standard deviations were obtained after 5-fold cross-validation.

To demonstrate our advantage, we compared it with three state-of-the-art graph CNN benchmarks: the pioneering graph CNN (G-CNN) bruna2013spectral , its spectral domain extension to -hop (GCN) defferrard2016convolutional and neural fingerprint (NFP) duvenaud2015convolutional . In Table. 1, our network reduced the mean RMSE by 31 -40 on Delaney dataset, averagely 15 on Az-logD and 24 on testing set of NCI. The improvements come from the more meaningful representations extracted by SGC-LL layer. First, the -hop kernel on spatial domain via Eq.(2) used to be impossible bruna2013spectral ; duvenaud2015convolutional , then re-parameterization offers feature domain filter mappings that was absent in defferrard2016convolutional . Besides, our residual Laplacian learning and updating scheme does learn a better graph structure that optimally fits the learning tasks while training, which makes more sense than graphs constructed by unsupervised clustering gadde2016superpixel or separate training networks henaff2015deep ; simonovsky2017dynamic .

4.3 Multi-task Classification on Pharmacological Datasets

Tox21 Dataset mayr2016deeptox we used contains 7,950 chemical compounds. It has 12 classification tasks for different essays of toxicity, however, not every sample contains all 12 labels. For those with missing labels, we excluded them when computing losses, but still kept them in train dataset. ClinTox is a public dataset of 1451 chemical compounds for clinical toxicological study together with labels for 2 tasks. Sider 333Sider Data Web: http://sideeffects.embl.de/ database records 1392 drugs and their 27 different side effect or adverse reaction. Toxcast is another toxicological research database that has 8,571 SMILES samples and the database has labels for 617 predictive tasks. For -task prediction, the network graph model will become an analog of K-ary tree with

leaf nodes, each of which is actually a fully connected layer followed by logistic regression to generate scores for each task

mayr2016deeptox . The displayed scores were averaged over all tasks at Table. 2. Obviously, our method greatly raises classification accuracy on both small and large datasets, and even 5 on average for 617 tasks on Toxcast dataset.

5 Conclusions

We proposed a new spectral graph convolution layer that learns the residual graph Laplacians via learning optimal metric weights. The proposed EGCN can be fed by a batch of arbitrarily shaped samples on graph. For each sample, the network can individually learn the graph structure that optimally expresses the hidden node-wise connectivity. The training in a supervised fashion was driven by context of learning tasks. The extensive experiments show that our evolving graph CNN outperforms the state-of-the-arts on multiple datasets. In future, we plan to design a real spatial kernel of elastic kernel on graphs. Second, the implementation of SGC-LL need to be remodeled and hopefully get accelerated. Another interesting work is to extend graph CNNs to applications such as natural language understanding and user-behavior prediction on social networks.

References