Introduction
In recent years, deep learning has made a significant impact on the field of computer vision. Various deep learning models have achieved stateoftheart results on a number of visionrelated benchmarks. In most cases, the preferred architecture is a Convolutional Neural Network (CNN). CNN models have been applied successfully to the tasks of image classification
[Krizhevsky, Sutskever, and Hinton2012], image superresolution
[Kim, Lee, and Lee2016], and video action recognition [Feichtenhofer, Pinz, and Zisserman2016], among many others.CNNs, however, are designed to work for data that can be represented as grids (e.g., videos, images, or audio clips) and do not generalize to graphs – which have more irregular structure. Due to this limitation, it cannot be applied directly to many realworld problems whose data come in the form of graphs – social networks [Perozzi, AlRfou, and Skiena2014] or citation networks [Lu and Getoor2003] in social network analysis, for instance.
A recent deep learning architecture, called Graph Convolutional Networks (GCN) [Kipf and Welling2017] approximates the spectral convolution operation on graphs by defining a layerwise propagation that is based on the onehop neighborhood of nodes. The firstorder filters used by GCNs were found to be useful and have allowed the model to beat many established baselines in the semisupervised node classification task.
However, in many cases, it has been shown that it may be beneficial to consider the higherorder structure in graphs [Yang et al.2018, Rossi, Ahmed, and Koh2018, Milo et al.2002, Rossi, Zhou, and Ahmed2018b]. In this work, we introduce a general class of graph convolution networks which utilize weighted multihop motif adjacency matrices [Rossi, Ahmed, and Koh2018] to capture higherorder neighborhoods in the graph. The weighted adjacency matrices are computed using various network motifs [Rossi, Ahmed, and Koh2018]. Fig. 1 shows an example of the node neighborhoods that are induced when we consider two different kinds of motifs, showing that the choice of motif can significantly alter the neighborhood structure of nodes.
Our proposed method, which we call Motif Convolutional Networks (MCN), uses a novel attention mechanism to allow each node to select the most relevant motifinduced neighborhood to integrate information from. Intuitively, this allows a node to select its onehop neighborhood (as in classical GCN) when its immediate neighborhood contains enough information for the model to classify the node correctly but gives it the flexibility to select an alternative neighborhood (defined by higherorder structures) when the information in its immediate vicinity is too sparse and/or noisy.
The aforementioned attention mechanism is trained using reinforcement learning which rewards choices (
i.e, actions) that consistently result in a correct classification. Our main contributions can be summarized as follows:
We propose a model that generalizes GCNs by introducing multiple weighted motifinduced adjacencies that capture various higherorder neighborhoods.

We introduce a novel attention mechanism that allows the model to choose the best neighborhood to integrate information from.

We demonstrate the superiority of the proposed method by comparing against strong baselines on three established graph benchmarks. We also show that it works comparatively well on graphs exhibiting heterophily.

We demonstrate the usefulness of attention by showing how different nodes prioritize different neighborhoods.
Related Literature
Neural Networks for Graphs Initial attempts to adapt neural network models to work with graphstructured data started with recursive models that treated the data as directed acyclic graphs [Sperduti and Starita1997, Frasconi, Gori, and Sperduti1998]. Later on, more generalized models called Graph Neural Networks (GNN) were introduced to process arbitrary graphstructured data [Gori, Monfardini, and Scarselli2005, Scarselli et al.2009].
Recently, with the rise of deep learning and the success of models such as recursive neural networks (RNN) [Hausknecht and Stone2015, Zhou et al.2016] for sequential data and CNNs for gridshaped data, there has been a renewed interest in adapting some of these approaches to arbitrary graphstructured data.
Some work introduced architectures tailored for more specific problem domains [Li et al.2016, Duvenaud et al.2015] – like NeuralFPS [Duvenaud et al.2015] which is an endtoend differentiable deep architecture which generalizes the wellknown WeisfeilerLehman algorithm for molecular graphs – while others defined graph convolutions based on spectral graph theory [Henaff, Bruna, and LeCun2015]. Another group of methods attempt to substitute principledyetexpensive graph convolutions using spectral approaches by using approximations of such. cheby (cheby) used Chebyshev polynomials to approximate a smooth filter in the spectral domain while GCNs [Kipf and Welling2017] further simplified the process by using firstorder filters.
The model introduced by GCN (GCN) has been shown to work well on a variety of graphbased tasks [Nguyen and Grishman2018, Yan, Xiong, and Lin2018, Kipf and Welling2017] and have spawned variants including [Velickovic et al.2018, AbuElHaija et al.2018]. We introduce a generalization of GCN [Kipf and Welling2017] in this work but we differ from past approaches in two main points: first, we use weighted motifinduced adjacencies to expand the possible kinds of node neighborhoods available to nodes, and secondly, we introduce a novel attention mechanism that allows each node to select the most relevant neighborhood to diffuse (or integrate) information.
Higherorder Structures with Network Motifs Network motifs [Milo et al.2002] are fundamental building blocks of complex networks; investigation of such patterns usually lead to the discovery of crucial information about the structure and the function of many complex systems that are represented as graphs. motifsbio (motifsbio) studied motifs in biological networks showing that the dynamical property of robustness to perturbations correlated highly to the appearance of certain motif patterns while motifstemporal (motifstemporal) looked at motifs in temporal networks showing that graphs from different domains tend to exhibit very different organizational structures as evidenced by the type of motifs present.
Multiple work have demonstrated that it is useful to account for higherorder structures in different graphbased ML tasks [Rossi, Ahmed, and Koh2018, Yang et al.2018, Ahmed et al.2018]. DeepGL [Rossi, Zhou, and Ahmed2018a] uses motifs as a basis to learn deep inductive relational functions that represent compositions of relational operators applied to a base graph function such as triangle counts. HONE (HONE) proposed the notion of higherorder network embeddings and demonstrated that one can learn better embeddings when various motifbased matrix formulations are considered. NEST (NEST) defined a hierarchical motif convolution for the task of subgraph identification for graph classification. In contrast, we propose a new class of higherorder network embedding methods based on graph convolutions that uses a novel motifbased attention for the task of semisupervised node classification.
Attention Models Attention was popularized in the deep learning community as a way for models to attend to important parts of the data [Mnih et al.2014, Bahdanau, Cho, and Bengio2015]. The technique has been successfully adopted by models solving a variety of tasks. For instance, it was used by attention1 (attention1) to take glimpses of relevant parts of an input image for image classification; on the other hand, attention3 (attention3) used attention to focus on taskrelevant parts of an image for the image captioning task. Meanwhile, attention2 (attention2) utilized attention for the task of machine translation by fixing the model attention on specific parts of the input when generating the corresponding output words.
There has also been a surge in interest at applying attention to deep learning models for graphs. The work of GAT (GAT) used a node selfattention mechanism to allow each node to focus on features in its neighborhood that were more relevant while GAM (GAM) used attention to guide a walk in the graph to learn an embedding for the graph. More specialized methods of graph attention models include [Choi et al.2017, Han, Liu, and Sun2018]
with GRAM (GRAM) using attention on a medical ontology graph for medical diagnosis and knowledgeattention (knowledgeattention) using attention on a knowledge graph for the task of entity link prediction. Our approach differs significantly, however, from previous approach in that we use attention to allow our model to select task relevant neighborhoods.
Approach
We begin this section by introducing the foundational layer that is used to construct arbitrarily deep motif convolutional networks. When certain constraints are imposed on our model’s architecture, the model degenerates into a GAT [Velickovic et al.2018] which, in turn, generalizes a GCN [Kipf and Welling2017]. Because of this, we briefly introduce a few necessary concepts from [Velickovic et al.2018, Kipf and Welling2017] before defining the actual neural architecture we employ – including the reinforcement learning strategy we use to train our attention mechanism.
Graph SelfAttention Layer
A multilayer GCN [Kipf and Welling2017] is constructed using the following layerwise propagation:
(1) 
Here, is the modified adjacency matrix of the input graph with added selfloops – is the original adjacency matrix of the input undirected graph with nodes while
represents an identity matrix of size
. The matrix , on the other hand, is the diagonal degree matrix of (i.e., ). Finally, is the matrix of node features inputted to layer while is a trainable embedding matrix used to embed the given inputs (typically to a lower dimension) and is a nonlinearity.The term in Eq. 1 produces a symmetric normalized matrix which update’s each nodes representation via a weighted sum of the features in a node’s onehop neighborhood (the added selfloop allows the model to include the node’s own features). Each link’s strength (i.e., weight) is normalized by considering the degrees of the corresponding nodes. Formally, at each layer , node integrates neighboring features to obtain a new feature/embedding via
(2) 
where is the feature of node at layer , with fixed weights , and is the set of ’s neighbors defined by the matrix – which includes itself.
In GAT [Velickovic et al.2018], Eq. 2 is modified with weights that are differentiable or trainable and this can be viewed as follows,
(3) 
The attention vector
in Eq. 3 is a trainable weight vector that assigns importance to the different neighbors of allowing the model to highlight particular neighboring node features that are more taskrelevant.Using the formulation in Eq. 3 with Eqs. 1 and 2, we can now define multiple layers which can be stacked together to form a deep GCN (with selfattention) that is endtoend differentiable. The initial input to the model can be set as , where is the initial node attribute matrix with attributes. The final layer’s weight matrix can also be set accordingly to output node embeddings at the desired output dimensions.
Convolutional Layer with Motif Attention
We observe that both GCN and GAT rely on the onehop neighborhood of nodes (i.e., in Eq. 1) to propagate information. However, it may not always be suitable to apply a single uniform definition of node neighborhood for all nodes. For instance, we show an example in Fig. 2 where a node can benefit from using a neighborhood defined using triangle motifs to keep only neighbors connected via a stronger bond which is a wellknown concept from social theory allowing us to distinguish between weaker ties and strong ones via the triadic closure [Friggeri, Chelius, and Fleury2011].
Weighted MotifInduced Adjacencies
Given a network with nodes, edges, as well as a set of network motifs^{1}^{1}1We use the term motifs loosely here and it can also be used to mean graphlets or orbits [Ahmed et al.2017]. , we can construct different motifinduced adjacency matrices with:
As shown in Fig. 1, neighborhoods defined by different motifs can vary significantly. Furthermore, the weights in a motifinduced adjacency can also vary as motifs can appear in varying degrees of frequency between different pairs of nodes.
Motif Matrix Functions
Each of the calculated motif adjacencies can now be potentially used to define motifinduced neighborhoods with respect to a node . While Eq. 3 defines selfattention weights over a node’s neighborhood, the initial weights in
can still be used as reasonable initial estimates of each neighbor’s “importance.”
Hence, we introduce a motifbased matrix formulation as a function over a motif adjacency similar to [Rossi, Ahmed, and Koh2018]. Given a function , we can obtain motifbased matrices , for . Below, we summarize the different variants of that we chose to investigate.
Unweighted Motif Adjacency w/ Selfloops: In the simplest case, we can construct (here on, we omit the subscripts for brevity) from by simply ignoring the weights:
(4) 
But, as mentioned above, we lose the initial benefit of leveraging the weights in the motifinduced adjacency .
Weighted Motif Adjacency w/ Rowwise Max: We can also choose to retain the weighted motif adjacency without modification save for added rowwise maximum selfloops. This is defined as
(5) 
where is a diagonal square matrix with . Intuitively, this allows us to assign an equal amount of importance to a selfloop consistent with that given to each node’s most important neighbor.
Motif Transition w/ Rowwise Max:
The random walk on the weighted graph with added rowwise maximum selfloops has transition probabilities
. Our random walk motif transition matrix can thus be calculated by(6) 
where, in this context, the matrix is the diagonal square degree matrix of (i.e., ) while is defined as above. Here, or the transition probability from node to is proportional to the motif count between nodes and relative to the total motif count between and all its neighbors.
Absolute Motif Laplacian: The absolute Laplacian matrix can be constructed as follows:
(7) 
Here, the matrix is the degree matrix of . Note that because the selfloop is a sum of all the weights to a node’s neighbors, the initial importance of the node itself can be disproportionately large.
Symmetric Normalized Matrix w/ Rowwise Max: Finally, we calculate a symmetric normalized matrix (similar to the normalized Laplacian) via:
(8) 
Here, based on the context, the matrix is the diagonal degree matrix of .
KStep Motif Matrices
Given a stepsize , we further define different step motifbased matrices for each of the motifs which gives a total of adjacency matrices. Formally, this is formulated as follows:
(9) 
where
(10) 
When we set , we allow nodes to accumulate information from a wider neighborhood. For instance if we choose to use Eq. 4 (for ) and use an edge as our motif, (we omit the motiftype subscript here) then captures hop neighborhoods of each node. While, in theory, using is equivalent to a layer GCN or GAT model [AbuElHaija et al.2018] have shown that GCNs don’t necessarily benefit from a wider receptive field from increasing model depth. This may be for reasons similar as to why skipconnections are needed in deep architectures since the signal starts to degrade as the model gets deeper [He et al.2016].
As another example, we set to Eq. 6. Now for an arbitrary motif, we see that encodes the probability of transitioning from node to node in steps.
Motif Matrix Selection via Attention
Given different motifs and a stepsize of , we now have motif matrices we could use with Eq. 1 to define layerwise propagations. A simple approach would be to implement idependent GCN instances and concatenate the final node outputs before classification. However, this approach may have problems scaling when and/or is large.
Instead, we propose to use an attention mechanism, at each layer, to allow each node to select a single most relevant neighborhood to integrate or accumulate information from. For a layer , this can be defined by two functions and , where is the dimension of the statespace for layer . The functions’ outputs are softmaxed
to form probability distributions over
and , respectively. Essentially, what this means is that given a node ’s state, the functions recommend the most relevant motif and step size for node to integrate information from.Specifically, we define the state matrix encoding node states at layer as a concatenation of two matrices:
(11) 
where is the weight matrix that embeds the inputs to dimension , is the matrix containing local information obtained by doing a weighted sum of the features in the simple onehop neighborhood for each node (from the original adjacency ), and is a motif count matrix that gives us basic local structural information about each node by counting the number of different motifs that each node belongs to. We note here that is not appended to the node attribute matrix and is not used for prediction. It’s only purpose is to capture the local structural information of each node. is precomputed once.
Let us consider an arbitrary layer. Recall that (for brevity, we omit subscripts ) produces a probability vector specifying the importance of the various motifs, let be the motif probabilities for node . Similarly, let be the probability vector recommending the step size. Now let be the index of the largest value in and similarly, let be the index of the largest value in . In other words, is the recommended motif for while is the recommended stepsize. Attention can now be used to define an propagation matrix as follows:
(12) 
This layerspecific matrix can now be plugged into Eq. 1 to replace . What this does is it gives each node the flexibility to select the most appropriate motif and stepsize to integrate information from.
Training the Attention Mechanism
Given a labeled graph with nodes and a labeling function which maps each node to one of class labels in , our goal is to train a classifier that can predict the label of all the nodes. Given a subset , or the training set of nodes, we can train an layer MCN (the classifier) using standard crossentropy loss as follows:
(13) 
where is a binary value indicating node ’s true label (i.e., if , zero otherwise), and is the softmaxed output of the MCN’s last layer.
While Eq. 13
is sufficient for training the MCN to classify inputs it does not tell us how we can train the attention mechanism that selects the best motif and stepsize for each node at each layer. We define a second loss function based on reinforcement learning as follows:
(14) 
Here, is the reward we give to the system ( if we classify correctly, otherwise). The intuition here is this: at the last layer we reward the actions of the classified nodes; we then go to the previous layer (if there is one) and reward the actions of the neighbors of the classified nodes since their actions affect the outcome, we continue this process until we reach the first layer.
There are a few important things to point out. In practice, we use an greedy strategy when selecting a motif and stepsize during training. Specifically, we pick the action with highest probability most of the time but during instances we select a random action. During testing, we choose the action with highest probability. Also, in practice, we use dropout to train the network as in GAT [Velickovic et al.2018]
which is a good regularization technique but also has the added advantage of being a way to sample the neighborhood during training to keep the receptive field from growing too large during training. Finally, to reduce model variance we can also include an advantage term (see Eq. 2 in
[Lee, Rossi, and Kong2018], for instance). Our final loss can then be written as:(15) 
We show a simple (layer) example of the proposed MCN model in Fig. 3. As mentioned, MCN generalizes both GCN and GAT. We list settings of these methods in Tab. 1.
Method  Motif  Adj.  K  Selfattention  Motifattention 

GCN  edge  Eq. 4  no  no  
GAT  edge  Eq. 4  yes  no  
MCN*  any  Eqs. 48  yes  yes 
method  dataset  avg. rank  

Cora  Citeseer  Pubmed  
DeepWalk [Perozzi, AlRfou, and Skiena2014]  67.2% ( 9)  43.2% ( 11)  65.3% ( 11)  10.3 
MLP  55.1% ( 12)  46.5% ( 9)  71.4% ( 9)  10.0 
LP [Zhu, Ghahramani, and Lafferty2003]  68.0% ( 8)  45.3% ( 10)  63.0% ( 12)  10.0 
ManiReg [Belkin, Niyogi, and Sindhwani2006]  59.5% ( 10)  60.1% ( 7)  70.7% ( 10)  9.0 
SemiEmb [Weston et al.2012]  59.0% ( 11)  59.6% ( 8)  71.7% ( 8)  9.0 
ICA [Lu and Getoor2003]  75.1% ( 7)  69.1% ( 5)  73.9% ( 7)  6.3 
Planetoid [Yang, Cohen, and Salakhutdinov2016]  75.7% ( 6)  64.7% ( 6)  77.2% ( 5)  5.7 
Chebyshev [Defferrard, Bresson, and Vandergheynst2016]  81.2% ( 5)  69.8% ( 4)  74.4% ( 6)  5.0 
MoNet [Monti et al.2016]  81.7% ( 3)  –  78.8% ( 4)  3.5 
GCN [Kipf and Welling2017]  81.5% ( 4)  70.3% ( 3)  79.0% ( 2)  3.0 
GAT [Velickovic et al.2018]  83.0 0.7% ( 2)  72.5 0.7% ( 2)  79.0 0.3% ( 2)  2.0 
MCN (this paper) 
83.5 0.4% ( 1)  73.3 0.7% ( 1)  79.3 0.3% ( 1)  1.0 
Experimental Results
Semisupervised node classification
We first compare our proposed approach against a set of strong baselines (including methods that are considered the current stateoftheart) on three wellknown graph benchmark datasets for semisupervised node classification. We show that the proposed method is able to achieve stateoftheart results on all compared datasets.
Datasets
The datasets used for comparison are Cora, Citeseer, and Pubmed. Specifically, we use the preprocessed version made available by planetoid (planetoid). The aforementioned graphs are undirected citation networks where nodes represent documents and edges denote citation; furthermore, a bagofwords vector capturing word counts in each document serves as each node’s feature. Each document is assigned a class label.
The graph in Cora has , , 7 classes, and node features. The statistics for Citeseer are , , with 6 classes, and . Finally, Pubmed consists of a graph with , , with 3 classes, and . Following previous work, we use only 20 nodes per class for training [Yang, Cohen, and Salakhutdinov2016, Kipf and Welling2017, Velickovic et al.2018]. Again, following the procedure in previous work, we take 1,000 nodes per dataset for testing and further take an additional 500 for validation as in [Kipf and Welling2017, Velickovic et al.2018, AbuElHaija et al.2018]. We use the same train/test/validation splits as defined in [Kipf and Welling2017, Velickovic et al.2018].
Setup
For Cora and Citeseer, we used the same layer model architecture as that of GAT consisting of selfattention heads each with a total of
hidden nodes (for a total of 64 hidden nodes) in the first layer, followed by a single softmax layer for classification
[Velickovic et al.2018]. Similarly, we fixed earlystopping patience at and regularization at . For Pubmed, we also used the same architecture as that of GAT (first layer remains the same but the output layer has attention heads to deal with sparsity in the training data). Patience remains the same and similar to GAT, we use a strong regularization at .We further optimized all models by testing dropout values of , learning rates of , stepsizes , and motif adjacencies formed using combinations of the following motifs: edge, 2star, triangle, 3star, and 4clique (please refer to Tab. 1(d) for motifs). Selfattention learns to prioritize neighboring features that are more relevant. (Eqs. 48
) can be used as a reasonable initial estimate of the importance of neighboring features. For each unique setting of the hyperparameters mentioned previously, we try Eqs.
48 and record the best result. Finally, we adopt an greedy strategy ().Comparison
For all three datasets, we report the classification accuracy averaged over
runs on random seeds (including standard deviation). Since we utilize the same train/test splits as previous work, we follow
[Velickovic et al.2018, Kipf and Welling2017] and compile all previously reported results.A summary of the results is shown in Tab. 2. We see that our proposed method achieves superior performance against all tested baselines on all three benchmarks. On the Cora dataset, the best model used a learning rate of , dropout of , and both the edge and triangle motifs with stepsize . For Citeseer, the learning rate was and dropout was still while the only motif used was the edge motif with stepsize . However, the second best model for Citeseer – which had comparable performance – utilized the following motifs: edge, 2star, and triangle. Finally, on Pubmed, the best model used learning rate and dropout of . Once again, the best motifs were the edge and triangle motifs on .
We find that the triangle motif is useful in improving classification performance on the compared datasets. This highlights an advantage of MCN over past approaches (e.g., GCN & GAT) which do not handle triangles (and other motifs) naturally. The results seem to indicate that it can be beneficial to consider stronger bonds (friends that are friends themselves) when selecting a neighborhood.
Comparison on Datasets exhibiting Heterophily
The benchmark datasets (Cora, Citeseer, and Pubmed) that we initially tested our method on exhibited strong homophily where nodes that share the same labels tend to form densely connected communities. Under these circumstances, methods like GAT or GCN that use a firstorder propagation rule will perform reasonably well. However, not all realworld graphs share this characteristic and in some cases the node labels are more spread out. In this latter case, there is reason to believe that neighborhoods constructed using different motifs – other than just edges and triangles – may be beneficial.
We test this hypothesis by comparing GAT and GCN against MCN on two graphs from the DD dataset [Kersting et al.2016]. Specifically, we chose two of the largest graphs in the dataset: DD6 and DD7 – with a total of and nodes, respectively. Both graphs had twenty different node labels with the labels being quite imbalanced.
We stick to the semisupervised training regime, using only nodes per class for training with the rest of the nodes split evenly between testing and validation. This makes the problem highly challenging since the graphs do not exhibit homophily. Since the nodes do not have any attributes, we use the wellknown WeisfeilerLehman algorithm (we initialize node attributes to a single value and run the algorithm for 3 iterations) to generate node attributes that capture each node’s neighborhood structure.
For the three approaches (GCN, GAT, and MCN), we fix earlystop patience at and use a twolayer architecture with 32 hidden nodes in the first layer followed by the softmax output. We optimized the hyperparameters by searching over learning rate in , regularization in , dropout at . Furthermore, for MCN, we considered combinations of the following motifs {edge, 2star, triangle, 4pathedge, 3star, 4cycle, 4clique} and considered steps from . Since there are multiple classes and they are highly imbalanced, we report the MicroF1 score averaged over 10 runs.
method  dataset  

DD6  DD7  
GCN  %  % 
GAT  %  % 
MCN  

A summary of the results is shown in Tab. 3. While the methods do not perform very well (to be expected since we use a very small subset for training on graphs that do not have a high degree of homophily) we do find that with everything else constant (model architecture), it is actually valuable to use motifs. For DD6, the best method utilized all motifs except for the 4pathedge with while in DD7 the best approach only used the edge, triangle, and 4clique motifs with .
Visualizing Motif Attention
We ran an instance of MCN with the following motifs: edge, 4path, and triangle with on the Cora dataset. Fig. 4 shows the motif that was selected by the attention mechanism. A few interesting things can be observed here. First, the nodes at the fringe prioritized the 4path motif which is quite intuitive since this allows the nodes to aggregate information from a wider (4hop) neighborhood which is useful for nodes near the fringe that are more separated from the other nodes in the class. On the other hand, we observe that nodes that chose the triangle motif are almost always found in denser parts of the graph. This shows that it may be beneficial in these cases to consider stronger bonds (e.g., when the neighborhood is noisy). Finally, we see that attention allows different nodes to select different motifs and the system is not “defaulting” to a single motif type.
Conclusion
In this work, we introduce the Motif Convolutional Network which uses a novel attention mechanism to allow different nodes to select a different neighborhood to integrate information from. The method generalizes both GAT and GCN. Experiments on three citation (Cora, Citeseer, & Pubmed) and two bioinformatic (DD6 & DD7) benchmark graphs show the advantage of the proposed approach. We also show experimentally that different nodes do utilize attention to select different neighborhoods, indicating that it may be useful to consider various motifdefined neighborhoods.
References
 [AbuElHaija et al.2018] AbuElHaija, S.; Kapoor, A.; Perozzi, B.; and Lee, J. 2018. NGCN: Multiscale graph convolution for semisupervised node classification. In arXiv:1802.08888v1.
 [Ahmed et al.2017] Ahmed, N. K.; Neville, J.; Rossi, R. A.; Duffield, N.; and Willke, T. L. 2017. Graphlet decomposition: Framework, algorithms, and applications. KAIS 50(3):689–722.
 [Ahmed et al.2018] Ahmed, N. K.; Rossi, R. A.; Zhou, R.; Lee, J. B.; Kong, X.; Willke, T. L.; and Eldardiry, H. 2018. Learning rolebased graph embeddings. In StarAI @ IJCAI, 1–8.
 [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR, 1–15.
 [Belkin, Niyogi, and Sindhwani2006] Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 7:2399–2434.
 [Choi et al.2017] Choi, E.; Bahadori, M. T.; Song, L.; Stewart, W. F.; and Sun, J. 2017. Gram: Graphbased attention model for healthcare representation learning. In KDD, 787–795.
 [Defferrard, Bresson, and Vandergheynst2016] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 3837–3845.
 [Duvenaud et al.2015] Duvenaud, D. K.; Maclaurin, D.; AguileraIparraguirre, J.; GomezBombarelli, R.; Hirzel, T.; AspuruGuzik, A.; and Adams, R. P. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2224–2232.
 [Feichtenhofer, Pinz, and Zisserman2016] Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional twostream network fusion for video action recognition. In WSDM, 601–610.
 [Frasconi, Gori, and Sperduti1998] Frasconi, P.; Gori, M.; and Sperduti, A. 1998. A general framework for adaptive processing of data structures. IEEE TNNLS 9(5):768–786.
 [Friggeri, Chelius, and Fleury2011] Friggeri, A.; Chelius, G.; and Fleury, E. 2011. Triangles to capture social cohesion. In SocialCom/PASSAT, 258–265.
 [Gori, Monfardini, and Scarselli2005] Gori, M.; Monfardini, G.; and Scarselli, F. 2005. A new model for learning in graph domains. In IJCNN, 729–734.
 [Han, Liu, and Sun2018] Han, X.; Liu, Z.; and Sun, M. 2018. Neural knowledge acquisition via mutual attention between knowledge graph and text. In AAAI, 1–8.
 [Hausknecht and Stone2015] Hausknecht, M., and Stone, P. 2015. Deep recurrent Qlearning for partially observable MDPs. In AAAI Fall Symposium, 1–9.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
 [Henaff, Bruna, and LeCun2015] Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep convolutional networks on graphstructured data. In arXiv:1506.05163v1.
 [Kersting et al.2016] Kersting, K.; Kriege, N. M.; Morris, C.; Mutzel, P.; and Neumann, M. 2016. Benchmark data sets for graph kernels.
 [Kim, Lee, and Lee2016] Kim, J.; Lee, J. K.; and Lee, K. M. 2016. Deeplyrecursive convolutional network for image superresolution. In CVPR, 1637–1645.
 [Kipf and Welling2017] Kipf, T. N., and Welling, M. 2017. Semisupervised classification with graph convolutional networks. In ICLR, 1–14.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In NIPS, 1106–1114.
 [Lee, Rossi, and Kong2018] Lee, J. B.; Rossi, R.; and Kong, X. 2018. Graph classification using structural attention. In KDD, 1666–1674.
 [Li et al.2016] Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2016. Gated graph sequence neural networks. In ICLR, 1–20.
 [Lu and Getoor2003] Lu, Q., and Getoor, L. 2003. Linkbased classification. In ICML, 496–503.
 [Milo et al.2002] Milo, R.; ShenOrr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; and Alon, U. 2002. Network motifs: Simple building blocks of complex networks. Science 298(5594):824–827.
 [Mnih et al.2014] Mnih, V.; Heess, N.; Graves, A.; and Kavukcuoglu, K. 2014. Recurrent models of visual attention. In NIPS, 2204–2212.
 [Monti et al.2016] Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; and Bronstein, M. M. 2016. Deep convolutional networks on graphstructured data. In arXiv:1611.08402.
 [Nguyen and Grishman2018] Nguyen, T. H., and Grishman, R. 2018. Graph convolutional networks with argumentaware pooling for event detection. In AAAI, 5900–5907.
 [Paranjape, Benson, and Leskovec2017] Paranjape, A.; Benson, A. R.; and Leskovec, J. 2017. Motifs in temporal networks. In CVPR, 1933–1941.
 [Perozzi, AlRfou, and Skiena2014] Perozzi, B.; AlRfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In KDD, 701–710.
 [Prill, Iglesias, and Levchenko2005] Prill, R. J.; Iglesias, P. A.; and Levchenko, A. 2005. Dynamic properties of network motifs contribute to biological network organization. PLoS Biology 3(11):1881–1892.
 [Rossi, Ahmed, and Koh2018] Rossi, R. A.; Ahmed, N. K.; and Koh, E. 2018. Higherorder network representation learning. In WWW, 3–4.
 [Rossi, Zhou, and Ahmed2018a] Rossi, R. A.; Zhou, R.; and Ahmed, N. K. 2018a. Deep inductive network representation learning. In BigNet @ WWW, 1–8.
 [Rossi, Zhou, and Ahmed2018b] Rossi, R. A.; Zhou, R.; and Ahmed, N. K. 2018b. Estimation of graphlet counts in massive networks. In TNNLS, 1–14.
 [Scarselli et al.2009] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. IEEE TNNLS 20(1):61–80.
 [Sperduti and Starita1997] Sperduti, A., and Starita, A. 1997. Supervised neural networks for the classification of structures. IEEE TNNLS 8(3):714–735.
 [Velickovic et al.2018] Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph attention networks. In ICLR, 1–12.
 [Weston et al.2012] Weston, J.; Ratle, F.; Mobahi, H.; and Collobert, R. 2012. Deep Learning via Semisupervised Embedding. Springer. 639–655.
 [Xu et al.2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A. C.; Salakhutdinov, R.; Zemel, R. S.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2048–2057.
 [Yan, Xiong, and Lin2018] Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial temporal graph convolutional networks for skeletonbased action recognition. In AAAI, 3482–3489.
 [Yang et al.2018] Yang, C.; Liu, M.; Zheng, V. W.; and Han, J. 2018. Node, motif and subgraph: Leveraging network functional blocks through structural convolution. In ASONAM, 1–8.

[Yang, Cohen, and
Salakhutdinov2016]
Yang, Z.; Cohen, W. W.; and Salakhutdinov, R.
2016.
Revisiting semisupervised learning with graph embeddings.
In ICML, 40–48. 
[Zhou et al.2016]
Zhou, G.B.; Wu, J.; Zhang, C.L.; and Zhou, Z.H.
2016.
Minimal gated unit for recurrent neural networks.
IJAC 13(3):226––234.  [Zhu, Ghahramani, and Lafferty2003] Zhu, X.; Ghahramani, Z.; and Lafferty, J. D. 2003. Semisupervised learning using gaussian fields and harmonic functions. In ICML, 912–919.