1 Introduction
Convolutional neural networks (CNNs) have achieved great success in various tasks from computer vision
[Huang:2016wa], speech recognition [Zhang:2017up]and natural language processing
[Conneau:2017to]. CNN provides us an efficient and effective architecture to learn meaningful representations for graphics and texts. In recent years, researchers thrive to extend the operator of convolution and develop CNN architectures for graphs, which possibly have more complicated structures than images. The graph convolutional networks are usually applied on the following two centrics of learning tasks:
Node centric: the prediction tasks related to the nodes in a graph. The graph convolutional networks usually do so via outputting a feature vector for each node in the graph, which meaningfully reflects the node’s property and neighborhood structure. For example, in social networks, the vectors can be used for tasks like node classification and link prediction. Sometimes, this is related to node representation learning.

Graph centric: the prediction tasks related to the graphs. For example, in the context of chemistry, a molecule can be viewed as a graph with atoms as nodes and bonds as edges. graph convolutional networks are constructed to encode the molecules meaningfully in terms of their physical and chemical properties. These tasks are therefore the key to many realworld applications such as material design and drug screening. In this context, graph convolutional networks usually find a way to encode the graph and use the encodings for graph prediction tasks.
Early efforts on designing neural networks for graphs date back to the works of Gori et al. and Scarselli et al. [gori2005new, scarselli2009graph], in which they built sequential or recurrent network architectures for graphstructured data. The study of Bruna et al. [bruna2013spectral], Edwards et al. [Edwards:2016vy] and Defferrard et al. [defferrard2016convolutional] further developed the idea of spectral filtering/convolution which operates on the graph spectrum. Henaff et al. [henaff2015deep]
extended the graph convolutional networks to large scale datasets like ImageNet Object Recognition, text categorization, and bioinformatics. Meanwhile, Niepert et al.
[niepert2016learning] proposed an approach of PATCHYSAN, which defined operations of node sequence selection, neighborhood assembly, and graph normalization. Atwood et al. [Atwood:2016wq] presented diffusionconvolutional neural networks (DCNNs) model for graphstructured data. As we will show later, these models successfully made CNN work under the graph settings, but they still lack of careful considerations for the specialties of the graph structures in the network design.Also there are several newlypublished results on conducting graph convolutions dynamically. Jia et al. [Jia:2016wt] proposed the Dynamic Filter Network, where filters are generated dynamically conditioned on the input features. Simonovsky et al. [Simonovsky:2017tv] extended that idea to graphs, using edgeconditioned dynamic weights for graph convolutions. The work of Verma et al. [Verma:2017tb] managed to determine the shape of filters as a function of the features in previous network layers. Manessi et al. [Manessi:2017wp] proposed a model to learn temporal information from graphs that have a changing structure overtime. Li et al. [Li:2017ud] proposed a general and flexible graph convolutional network (EGCN) to deal with data with diverse or undefined dimensions. All these research found one way or another to dynamically utilize the graph data, and our work can be seen as a further endeavor with the introduction of the adaptive convolution module.
Besides, a great amount of research on graphcentric tasks concentrates on the application of molecule fingerprints. The molecule fingerprinting refers to a quantitative encoding for the molecules that can be used for molecule property summarization and prediction. Prior to the usage of CNN, the graph kernels have dominated many learning and prediction tasks for (molecule) graphs [kondor2002diffusion, shervashidze2009efficient, shervashidze2011weisfeiler]. The paper of Duvenaud et al. [duvenaud2015convolutional] first introduced CNN for encoding the molecules and Kearnes et al. [kearnes2016molecular] further improved the results. Gilmer et al. [Gilmer:2017tl] defined a Message Passing Neural Networks (MPNNs) for molecules to reformulate existing models into a single common framework with a message passing interpretation. It will be shown later that our work greatly improved the stateoftheart performance on moleculerelevant tasks with a network architecture that better captures the properties of the molecules graphs.
In this work, we proposed a novel graph convolutional network architecture named Highorder and Adaptive Graph Convolutional Network (HAGCN). The most related work to ours is the graph convolutional network (GCN) [kipf2016semi], in which the convolution operator only reaches onehop neighbors. Our highorder operator provides an efficient design of convolution that reaches hop neighbors. Furthermore, we introduced an adaptive filtering module that adjusts the weights of convolution operators dynamically based on the local graph connections and node features. Compared with the work of Li et al. [li2015gated] which introduced the modern idea of LSTM into graph settings, our adaptive module can be interpreted as a graph realization of the attention mechanism proposed in [xu2015show]. Most importantly, unlike the previous graph networks designed for either nodecentric or graphcentric task, our HAGCN framework is generalpurposed and capable of fulfilling both. Additionally, we constructed a graph generative model with HAGCN for the task of molecule generation, achieving a significant improvement over the stateoftheart model.
Our contribution is twofold:

We introduced two new modules for graphstructured data and built a novel graph convolutional network framework of HAGCN.

We developed a generalpurposed architecture that can be applied for nodecentric prediction, graphcentric prediction and graph generative modeling. Our architecture achieved stateoftheart performance uniformly on all the tasks.
The rest of the paper is organized as follows: first we provide some preliminaries for the graph model and a brief discussion of several frameworks of graph convolutional networks. Then we introduce the key ideas of highorder convolution operator and adaptive filtering module. Furthermore, we present our framework of HAGCN and several experiments to demonstrate its performance. Finally, we summarize the scope of HAGCN applications and point out the potential future directions.
2 Preliminaries
2.1 The Graph Model^{1}^{1}1In this paper, we use the terminology “graph” to refer to the graph/network structure of data and “network” for the architecture of machine learning models.
In this subsection, we provide the preliminaries and notations for the graph model. A graph is denoted as a pair with the set of nodes (vertices) and the set of edges. Here we do not distinguish the undirected and directed graphs in terms of notations since our framework works for both cases. Each graph can be represented by a by adjacency matrix where if there is an edge from to and otherwise. Based on the adjacency matrix, we can have a distance function to represent the graph distance from to (the minimum length of paths connecting and ). Additionally, we assume that each node is associated with a feature vector , and compactly we use to denote the feature matrix.
2.2 Graph Convolutional Networks (GCNs)
In this subsection, we briefly review several GCN structures from previous works to provide some intuitions for the design of convolution on graph. At the first place, the convolution operator at a specific node in graph can be generally expressed as
Here is the input feature for node , is the bias term and is the weight which can be nonstationary and vary with respect to . The set defines the scope of convolution. For traditional applications, the CNN architecture is usually designed for a lowdimensional grid with the same connection pattern for every node. For example, images can be viewed as twodimensional grids (for each of the RGB channels, or gray scale channel), and the underlying graph is formed by connecting adjacent pixels. Then can be simply defined as a fixedsize block or window around pixel .
In the more general graph settings, one can define as the set of nodes that are adjacent to . For example, the core of the fingerprint (FP) convolution operator in the work of Duvenaud et al. [duvenaud2015convolutional] is to compute the average over neighbors, i.e. for all . With the help of adjacency matrix we can write the operator as
(1) 
The multiplication of and the feature matrix results in a feature averaging over neighbor nodes. One step further, the paper of nodeGCN [kipf2016semi]
applied linear weighting and nonlinear transformation in addition to the averaging:
(2) 
The weight matrix and the function perform a linear and nonlinear transformation on the feature respectively.
The papers of Bruna et al.[bruna2013spectral] and Defferrard et al. [defferrard2016convolutional] took a different approach by conducting convolution on the spectrum of a graph Laplacian. Let be the graph Laplacian and its orthogonal decomposition (where
is a orthogonal matrix and
is a diagonal matrix). Instead of appending the weight matrix as in (2), the spectral convolution considers a parameterized convolution operator on Precisely,(3) 
Here the function is a polynomial function which is elementwisely applied on the diagonal matrix
When discussing the advantage of spectral convolution, the authors mentioned that a order polynomial choice of is exactly localized on graph, which means the convolution reaches as far as the hop neighbors. Compared to the onehop neighbor averaging in (1) and (2), this allows faster information propagation over the graph. However, the the choice of polynomial does not give an exact convolution operator for hop neighbors, as not all neighbors are equally weighted as assumed in convolution, with the fact in mind that . This motivates the proposal of our highorder convolution operator. Another problem with all of those convolution operators is that they are using fixed convolution weights, which are invariant across graphs. Therefore it can hardly capture the differences between the locations where the convolution operation happens. This motivates the design of our adaptive module, which successfully takes both the local features and the graph structures into account.
3 HighOrder and Adaptive Graph Convolutional Network (HAGCN)
3.1 th Order Graph Convolution
We begin with the definition of the hop (th order) neighborhood: for node . In fact, the exact hop connectivity can be obtained by the multiplication of the adjacency matrix , as formally stated in the following proposition.
Proposition 1.
Let be the adjacency matrix of a graph , then the entry of its th product is the number of hop paths from to .
With this proposition, we can define a th order convolution operator as follows
(4) 
where
(5) 
Here and refer to elementwise matrix product and minimum respectively. The is the weight matrix while is the bias matrix. The is obtained by clipping to
. The addition of identity matrix
to creates a self loop for each node in the graph. And the clipping is motivated by the fact that if the matrix of have elements larger than one, clipping those values to will exactly lead to the convolution of hop neighborhood. The input of the operator is the adjacency matrix and feature matrix . Its output has the same dimension as . As the name suggests, the convolution takes the feature vectors of a node’s hop neighbors as input and outputs the weighted average of them.The operator in (4) elegantly implements our idea of th order convolution on a graph, which is the convolution with kernel size of in conventional terminologies of CNN. On one hand, it can be viewed as an efficient highorder generalization of the firstorder graph convolution in (2). On the other hand, this operator is closely related to the graph spectral convolution in (3), as the th order polynomial on the graph spectrum can also be regarded as an operation within the scope of hop neighborhood .
3.2 Adaptive Filtering Module
Based on the operator (4), we now introduce an adaptive filtering module for graph convolution. It filters the convolution weights according to the the features and the neighborhood connection of a specific node. Take the molecule graph in chemistry for example, benzene rings are more important than alkyl chains when predicting the properties of molecules. As a result, we desire larger convolution weights for neighborhood atoms on the benzene rings than alkyl chains. Without the adaptive module, graph convolutions are spatially invariant and fails to work as desired. The introduction of adaptive filters will allow the network to find the convolution target adaptively and to better capture the locality disparities.
The idea of the adaptive filtering comes from the attention mechanism [xu2015show]
, which chose the interest pixels adaptively while generating the corresponding words in the output sequence. It can also be viewed as a variant of the gates that optionally let information through in Long ShortTerm Memory (LSTM) network
[hochreiter1997long]. Technically, our adaptive filter is a nonlinear operator on the weight matrix , i.e.(6) 
where denotes elementwise matrix product. In fact, the operator is determined together by and , reflecting both node features and graph connections,
We consider two candidates for the function :
(7) 
and
(8) 
Here and hereafter, refers to the matrix concatenation. The first operator considers the interaction of node features and graph connections via an inner product for and while the second one does so via linear transformation. In practice, we find that the linear adaptive filter (8) achieves a better performance than the product one (7
) on almost all tasks. Therefore, we will adopt and report the performance based on the linear one in the experiment section. The adaptive filters are designed for a weighted selection of nodes, therefore a sigmoid nonlinearity is applied to binarize its values. The parameter matrix
will align the output dimension of to be the same with matrix . Unlike the existing design of dynamic filters which generate the weights solely from node or edge features, our adaptive filtering module provides a more thorough consideration by taking both node features and graph connections into account.3.3 The Framework of HAGCN
In this subsection, we present the framework of HAGCN and demonstrate how it can be applied to various tasks. By adding the adaptive module (6) into the highorder convolution operator (4), we define the HA operator:
a)  b) 

Figure 1 gives a visualization of the operators and the framework HAGCN. Figure 1(a) illustrates the operator for a single node with : the bottom layer of adaptive filter applies to weight matrices and to obtain the adaptive weights and (illustrated by the orange and green lines); the second layer brings the adaptive weights and the corresponding adjacency matrix together for convolution. Figure 1(b) emphasizes the fact that the convolution is operated on each node in the graph, with a layerbylayer manner. It is important to notice that the highorder operator and adaptive filtering module (HA operators) can be used together with other neural network architectures/operations like fullyconnected layers, pooling layers, and nonlinear transformations. In this paper, we name the graph convolutional network architecture built with our HA operator as HAGCN.
After all layers of convolution, the features from different orders of convolution are concatenated together:
The framework of HAGCN takes a feature matrix of ( is the number of nodes in the network and is the dimension of node features) and outputs a matrix of shape , resulting in an increase of the feature dimension by a factor of . Now, we elaborate more on how to apply HAGCN on various tasks.
Nodecentric prediction: After the graph convolutions in HAGCN, each node is associated with a feature vector. The feature vectors can be used for tasks of nodecentric classification or regression. It is also closely related to the graph (network) representation learning [perozzi2014deepwalk, grover2016node2vec], which refers to the procedure of learning a feature vector for each individual in a complicated system. Under nodecentric settings, it means to learn a vector for each node in the graph that meaningfully reflects the local graph structure around that node. Our HAGCN also outputs a vector for each node in the graph. In this sense, HAGCN can be viewed as a supervised graph representation learning framework.
Graphcentric prediction:
To handle graphs of different sizes, the input adjacency matrix and feature matrix are padded with zero on the bottom right. Here we point out a subtle difference between nodecentric and graphcentric tasks: Under nodecentric settings, the dataset is a single network with part of the nodes’ label/value used as training set and the others as validation and test set, while under graphcentric settings, the dataset is a set of graphs (possibly of different sizes), divided into training/validation/test set. The HAGCN works for both cases and the number of parameters in HA convolutional layer is
with being the size of the graph (or the maximum size of the graphs). As will be demonstrated later in the experiment section, HAGCN is more prone to overfitting under nodecentric settings than graphcentric settings.Graph generative modeling: The task of graph generative modeling refers to the learning of a probabilistic model from a set of graphs , with which we can sample graphs that are unseen before but still have similar structures with the graphs in . With the adventure of variational autoencoder [kingma2013auto] and adversarial autoencoder [makhzani2015adversarial], graph convolutional networks can be made suitable for the task of generative modeling in addition to discriminative modeling.
An autoencoder always consists of two parts: an encoder and a decoder. The encoder maps the input data to an encoding vector and the decoder maps from back to . We call the encoding space
latent or hidden space. To make it a generative model, we usually assume a probabilistic distribution (for example a Gaussian distribution) over the latent space. Here we consider the usage of HAGCN as encoder for graph generative modeling. Given the length of the paper, we skip the technical discussion of the autoencoder model here and defer more details about the HAGCN autoencoder architecture to the experiment section. As an application, the graph generative models allow us to create a continuous representation of molecules and generate new chemical structures by searching the latent space, which can be used to guide the process of material design or drug screening.
4 Experiments
4.1 Nodecentric learning
First, we considered a nodecentric task of supervised document classification in citation graphs. The datasets [Sen:2008wi] have three citation graphs, where each graph contains bagofwords feature vectors for every document and a list of citation links between documents. We treated the citation links as (undirected) edges and construct a binary and symmetric adjacency matrix . Each document has a class label and the goal is to predict the class label from the document feature and the citation graph. The statistics of the datasets are as reported in [Sen:2008wi].
Dataset  Nodes  Edges  Classes  Features 

Citeseer  3,327  4,732  6  3,703 
Cora  2,708  5,429  7  1,433 
Pubmed  19,717  44,338  210  5,414 
Training and Architecture: We used the same GCN network structure of Kipf et al. [kipf2016semi], except a replacement of their firstorder graph convolutional layer with our HA layer. Here and hereafter, we use gcn_{} to denote graph convolutional layer up with order . fc refers fully connected layer with hidden units.
Method  Citeseer  Cora  Pubmed 

l1_logistic  0.653  0.701  0.693 
l2_logistic  0.672  0.724  0.685 
DeepWalk  0.631  0.746  0.712 
Planetoid  0.724  0.832  0.844 
GCN  0.776  0.889  0.839 
gcn_{1,2}  0.788  0.901  0.851 
adp_gcn_{1,2}  0.765  0.862  0.840 
regularized logistic regression, DeepWalk refers to the algorithm by Perozzi et al.
[perozzi2014deepwalk], Planetoid refers to the algorithm by Yang et al. [Yang:2016ts], and GCN refers to the graph convolutional neural network by Kipf et al. [kipf2016semi]. All the models are implemented with the opensource code on github.To compare the performance of different models, we randomly divided the dataset into training/validation/test sets with a ratio of and reported the prediction accuracy on test set in Table 1. The hyperparameters are (dropout rate), (L2 regularization), and (hidden units). From the perspective of node representation learning, the first three models are unsupervised but the last four are (semi)supervised. This explains why the later ones have better performance. With our secondorder HA graph convolution, the information from hop neighbors can be utilized, resulting in an approximately % increment of accuracy. Also, the adaptive module fails to further improve the accuracy. This is because the adaptive filter is designed to generate different filter weights for different graphs. However, each nodecentric task has only one graph, whose convolution weights can be learned directly. Therefore, the adaptive module becomes redundant in this nodecentric setting.
4.2 Graphcentric learning
In this experiment, we demonstrated the performance of HAGCN on prediction tasks for molecule graphs. The goal is to predict the molecule’s properties based on a molecule graph. We used the same datasets as described in Duvenaud et al. [duvenaud2015convolutional] and evaluate the following three properties:

Solubility: The aqueous solubility of 1144 molecules by [delaney2004esol].

Drug efficacy: The halfmaximal effective concentration (EC50) in vitro of 10,000 molecules against a sulfideresistant strain of P. falciparum, the parasite that causes malaria, as measured by [gamo2010thousands].

Organic photovoltaic efficiency: The Harvard Clean Energy Project [hachmann2011harvard]
uses expensive DFT simulations to estimate the photovoltaic efficiency of 30,000 organic molecules.
With the same process described in Duvenaud et al. [duvenaud2015convolutional], we first used RDKit [landrum2006rdkit] to convert the SMILE [weininger1988smiles] representation of molecules into graphs, which treats hydrogen atoms implicitly. Each node in the graph corresponds to an atom and is appended with a
dimensional initial feature vector. The features concatenate a onehot encoding of the atoms element, its degree, the number of attached hydrogen atoms, and the implicit valence, and an aromaticity indicator.
Training and Architecture: The following network architectures are used for comparison. l1_gcn and l2_gcn refer to convolutional networks with and graph convolutional layer(s), respectively. To compare the performance of different models, we reported the root of mean square errors (RMSEs) in Table 2.
Name  Architectures 

l1_gcn  gcn_{1,2,3}ReLUfc64ReLUfc16ReLUfc1 
l1_adp_gcn  adp_gcn{1,2,3}ReLUfc64ReLUfc16ReLUfc1 
l2_gcn  [gcn_{1,2,3}ReLU]*2fc64ReLUfc16ReLUfc1 
l2_adp_gcn  [adp_gcn_{1,2,3}ReLU]*2fc64ReLUfc16ReLUfc1 
Model  Dataset  

Solubility  Drug  Photovoltaic  
efficacy  efficiency  
NFP  0.52  1.16  1.43 
MGC  0.46  1.07  1.10 
nodeGCN  0.54  1.14  1.45 
l1_gcn  0.61  1.20  1.54 
l1_adp_gcn  0.50  1.17  1.24 
l2_gcn  0.56  1.09  1.35 
l2_adp_gcn  0.38  1.07  1.08 

The model nodeGCN is indeed a firstorder HAGCN without adaptive filtering module. From the comparison between nodeGCN, l1_gcn and l2_gcn, we can see the effectiveness of our highorder convolution operator. Also, the networks with adaptive modules have a uniformly better performance than their counterparts without the module, which demonstrates its advantage.
4.3 Graph Generative Modeling
In this experiment, we considered the task of graph generative modeling with HAGCN autoencoder. The network architectures are
where gcn_{} and fc are defined as before, and dconv is defined as .
We implemented HAGCN as the encoder for both variational autoencoder (VAE) and adversarial autoencoder (AAE). As stated before, the graph generative models can be used to guide the molecule synthesis and the model performance is evaluated based on the proportion of valid molecules in all the newlysampled molecules. We compared our HAGCN generative model with the stateoftheart RNN model of Grammar Variational Autoencoder (RNNGVAE)
[Kusner:2017tv] We closely followed the experiment setup as in the graph variational autoencoder by Kipf et al. [Kipf:2016ul], with a training data of SMILES molecules [weininger1988smiles] extracted randomly from the ZINC database by GómezBombarelli et al. [GomezBombarelli:2016vk]. Thenencodings were drawn from a normal distribution in the latent space, and decoded to generate molecules. The HAGCNAAE model got
% valid molecules, and HAGCNVAE got %, while the RNNGVAE got %. Here we achieved a significant gain in performance with HAGCN as the encoder for graph generative modeling.5 Visualization of the HAGCN
5.1 Visualization of the Convolution Weights
We visualized the convolution weights in (4) in the HAGCN trained for the task of photovoltaic efficiency prediction. The convolutional weight matrices are plotted as in Figure 2 and the darkness of a block corresponds to the weight value of corresponding node. We have following observations from the plots: First, it is easy to see that the weight matrices have symmetrical patterns, which is due to the symmetry of adjacency matrix (or ). Second, as the order of convolution increases, the weights on the central nodes increase as well. Our explanation is that as the order of convolution increases, there are more nodes in the reception field. The weight increments on the central nodes is to balance off the effect of having more nodes within the scope of convolution. Third, for weights of convolution orders larger than , we observe many offdiagonal blocks having large values (with dark blue color), showing the necessity of introducing the highorder convolution.
5.2 Visualization of Adaptive Filters
The adaptive filters in (6) learned from graph connections and node features are visualized in Figure 3. The atoms highlighted in red are the randomly selected central nodes for convolution, and the blue circles on atoms indicate the filter weights, with darker blue meaning larger weight. We have following observations: First, the adaptive filter weights are almost binarized, which means that the filters are capable of selecting nodes for convolution adaptively based on the features and connectivity. Second, for almost all molecules in Figure 3, the atoms being selected are atoms on aromatic rings, which agrees with the chemical intuition that aromatic rings are more important than alkyl chains in terms of predicting organic photovoltaic efficiency. Another interesting observation is that the adaptive filter automatically learned the orthopara rule in chemistry, which states that for the benzene ring, the functional groups on the opposite of (ortho) and next to (para) a specific atom has a greater influence on the properties of that atom than the functional groups on other sites. For example, in the molecule on Figure 3 row 2 column 1, the weights on the atoms which are opposite of and next to the central atom are selected against other atoms on the sixmember ring.
6 Conclusion
In this work, we developed a graph convolutional network architecture of HAGCN with two new convolution modules specially designed for graphstructured data. With experiments showing the effectiveness of those modules, we strongly advocate a consideration of them in all the graph convolutional network architecture design. For future works, on one hand, we believe that it still deserves more work on designing the convolution network with a careful thought on the underlying graph (global and local) structure. On the other hand, we are currently conducting experiments on automatic chemical design to further demonstrate the practical value of our framework.