1 Introduction
Deep learning on graphs, also known as geometric deep learning (GDL) or graph representation learning
(GRL), has emerged in a matter of just a few years from a niche topic to one of the most prominent fields in machine learning. Graph convolutional neural networks (GCNs), which can be traced back to the seminal work of Scarselli et al.
Scarselli et al. (2008), seek to generalize classical convolutional architectures (CNNs) to graphstructured data. A wide variety of convolutionlike operations have been developed on graphs, including ChebNet Defferrard et al. (2016), MoNet Monti et al. (2016), GCN Kipf and Welling (2017), SGCN Wu et al. (2019), GAT Velickovic et al. (2018), and GraphSAGE Hamilton et al. (2017b). We refer the reader to recent review papers Bronstein et al. (2017); Hamilton et al. (2017a); Battaglia and others (2018); Zhang et al. (2018) for a comprehensive overview of deep learning on graphs and its mathematical underpinnings.Graph deep learning models have been extremely successful in modeling relational data in a variety of different domains, including social network link prediction Zhang and Chen (2018), humanobject interaction Qi et al. (2018), computer graphics Monti et al. (2016), particle physics Choma et al. (2018), chemistry Duvenaud et al. (2015); Gilmer et al. (2017), medicine Parisot et al. (2018), drug repositioning Zitnik et al. (2018), discovery of anticancer foods Veselkov and others (2019), modeling of proteins Gainza and others (2019) and nucleic acids Rossi et al. (2019), and fake news detection on social media Monti et al. (2019) to mention a few. Somewhat surprisingly, often very simple architectures perform well in many applications Shchur et al. (2018b). In particular, graph convolutional networks (GCN) Kipf and Welling (2017) and their more recent variant SGCN Wu et al. (2019)
apply a shared nodewise linear transformation of the node features, followed by one or more iterations of diffusion on the graph.
Until recently, most of the research in the field has focused on smallscale datasets (CORA Sen et al. (2008) with only K nodes still being among the most widely used), and relatively little effort has been devoted to scaling these methods to webscale graphs such as the Facebook or Twitter social networks. Scaling is indeed a major challenge precluding the wide application of graph deep learning methods in largescale industrial settings: compared to Euclidean neural networks where the training loss can be decomposed into individual samples and computed independently, graph convolutional networks diffuse information between nodes along the edges of the graph, making the loss computation interdependent for different nodes. Furthermore, in typical graphs the number of nodes grows exponentially with the increase of the filter receptive field, incurring significant computational and memory complexity.
Graph sampling approaches Hamilton et al. (2017b); Ying et al. (2018a); Chen et al. (2018); Huang et al. (2018); Chen and Zhu (2018) attempt to alleviate the cost of training graph neural networks by selecting a small number of neighbors. GraphSAGE Hamilton et al. (2017b) uniformly randomly samples the neighborhood of a given node. PinSAGE Ying et al. (2018a) uses random walks to improve the quality of such approximation. ClusterGCN Chiang et al. (2019) clusters the graph and enforces diffusion of information only within the computed clusters. GraphSAINT Chiang et al. (2019)
uses unbiased estimators of neighborhood graphs. They propose multiple methods to sample minibatch subgraphs during training while using normalization technique to eliminate bias.
In this paper, we propose a simple scalable graph neural network architecture generalizing GCN, SGCN, ChebNet and related methods. Our architecture is analogous to the inception module Szegedy et al. (2015); Kazi et al. (2019) and combines graph convolutional filters of different size that are amenable to efficient precomputation, allowing extremely fast training and inference. Furthermore, our architecture is compatible with various sampling approaches.
We provide extensive experimental validation showing that, despite its simplicity, our approach produces comparable results to stateoftheart architectures on a variety of largescale graph datasets while being significantly faster (orders of magnitude) in training and inference.
2 Background and Related Work
2.1 Deep learning on graphs
Broadly speaking, the goal of graph representation learning is to construct a set of features (‘embeddings’) representing the structure of the graph and the data thereon. We can distinguish among Nodewise embeddings, representing each node of the graph, Edgewise embeddings, representing each edge in the graph, and Graphwise embeddings representing the graph as a whole. In the context of nodewise prediction problems (e.g. nodewise classification tasks), we can make the distinction between the following different settings or problems. Transductive learning assumes that the entire graph is known, and thus the same graph is used during training and testing (albeit different nodes are used for training and testing). In the Inductive setting, training and testing are performed on different graphs. Supervised learning uses a training set of labeled nodes (or graphs, respectively) and tries to predict these labels on a test set. The goal of Unsupervised
learning is to compute a representation of the nodes (or the graph, respectively) capturing the underlying structure. Typical representatives of this class of architectures are graph autoencoders
Kipf and Welling (2016) and random walkbased embeddings Grover and Leskovec (2016); Perozzi et al. (2014).A typical graph neural network architecture consists of graph Convolutionlike operators (discussed in details in Section 2.3) performing local aggregation of features by means of message passing with the neighbor nodes, and possibly Pooling amounting to fixed Dhillon et al. (2007) or learnable Ying et al. (2018b); Bianchi et al. (2019a) graph coarsening. Additionally, graph Sampling schemes (detailed in Section 2.4) can be employed on largescale graphs to reduce the computational complexity.
2.2 Basic notions
Let be an undirected weighted graph, represented by the symmetric adjacency matrix , where if and zero otherwise. The diagonal degree matrix represents the number of neighbors of each node. We further assume that each node is endowed with a
dimensional feature vector and arrange all the node features as rows of the
dimensional matrix .The normalized graph Laplacian is an positive semidefinite matrix . Given the dimensional node feature matrix , the Laplacian amounts to computing the difference of the feature at each node with the local weighted average:
where is the neighborhood of node .
The Laplacian admits an eigendecomposition of the form
with orthogonal eigenvectors
and nonnegative eigenvalues
arranged in increasing order . The eigenvectors play the role of a Fourier basis on the graph and the corresponding eigenvalues can be interpreted as frequencies. Thegraph Fourier transform
is given by , and one can define the spectral analogy of a convolution on the graph aswhere denotes the elementwise matrix product.
2.3 Convolutionlike operators on graphs
Spectral graph CNNs. Bruna et al. Bruna et al. (2014) used the graph Fourier transform to generalize convolutional neural networks (CNN) LeCun et al. (1989) to graphs. This approach has multiple drawbacks. First, the computation of the Fourier transform has complexity, in addition to precomputation of the eigenvectors . Second, the number of filter parameters is . Third, the filters are not guaranteed to be localized in the node domain. Fourth, the construction explicitly assumes the underlying graph to be undirected, in order for the Laplacian to be a symmetric matrix with orthogonal eigenvectors. Finally and most importantly, filters learned on one graph do not generalize to another.
ChebNet. A way to address these issues is to model the filter as a transfer function , applied to the Laplacian as . Unlike the construction of Bruna et al. (2014) that does not generalize across graphs, the filter computed in the above manner is stable under graph perturbations Levie et al. (2019). If is a smooth function, the resulting filters are localized in the node domain Henaff et al. (2015). In the case when is expressed as simple matrixvector operations (e.g. a polynomial Defferrard et al. (2016) or rational function Levie et al. (2018)), the eigendecomposition of the Laplacian can be avoided altogether.
A particularly simple choice is a polynomial spectral filter of degree , allowing the convolution to be computed entirely in the spatial domain as
(1) 
Note that such a filter has parameters , does not require explicit multiplication by , and has a compact support of hops in the node domain (due to the fact that affects only neighbors within hops). Though originating from a spectral construction, the resulting filter is an operation in the node domain amounting to a successive aggregation of features in the neighbor nodes. Second, using recursivelydefined Chebyshev polynomials with and , the computation can be performed with complexity for sparselyconnected graphs. Finally, the polynomial filters can be combined with nonlinearities, concatenated in multiple layers, and interleaved with pooling layers based on graph coarsening Defferrard et al. (2016).
GCN. In the case , equation (1) reduces to computing , which can be interpreted as a combination of the node features and their diffused version. Kipf and Welling Kipf and Welling (2017) proposed a model of graph convolutional networks (GCN) combining nodewise and graph diffusion operations:
(2) 
Here is the adjacency matrix with selfloops, is the respective degree matrix, and is a matrix of learnable parameters.
SGCN. Stacking GCN layers with elementwise nonlinearity
and a final softmax layer for node classification, it is possible to obtain filters with larger receptive fields on the graph nodes,
Wu et al. Wu et al. (2019) argued that graph convolutions with large filters is practically equivalent to multiple convolutional layers with small filters. They showed that all but the last nonlinearities can be removed without harming the performance, resulting in the simplified GCN (SGCN) model,
(3) 
2.4 Graph sampling
A characteristic of many graphs, in particular ‘smallworld’ social networks, is the exponential growth of the neighborhood size with number of hops . In this case, the matrix becomes dense very quickly even for small values of . For Webscale graphs such as Facebook or Twitter that typically have nodes and edges, the diffusion matrix cannot be stored in memory for training. In such a scenario, classic Graph Convolutional Neural Networks models such as GCN, GAT or MoNet are not applicable. Graph sampling has been shown to be a successful technique to scale GNNs to large graphs, by approximating with a matrix that has a significantly simpler structure amenable for computation. Generally, graph sampling produces a graph such that and . We can distinguish between three types of sampling schemes (Figure 2):
Nodewise sampling strategies perform graph convolutions on partial node neighborhoods to reduce computational and memory complexity. For each node of the graph, a selection criterion is employed to sample a fixed number of neighbors involved in the convolution operation and an aggregation tree is constructed out of the extracted nodes.
To overcome memory limitations, nodewise sampling strategies are coupled with minibatch training, where each training step is performed only on a batch of nodes rather than on the whole graph. A training batch is assembled by first choosing ‘optimization’ nodes (marked in orange in Figure 2, left), and partially expanding their corresponding neighborhoods. In a single training step, the loss is computed and optimized only for optimization nodes.
Nodewise sampling coupled with minibatch training was first introduced in GraphSAGE Hamilton et al. (2017b) to address the challenges of scaling GNNs. PinSAGE Ying et al. (2018a) extended GraphSAGE by exploiting a neighbor selection method using scores from approximations of personalized PageRank via random walks. VRGCN Chen and Zhu (2018)
uses control variates to reduce the variance of stochastic training and increase the speed of convergence with a small number of neighbors.
Layerwise sampling Chen et al. (2018); Huang et al. (2018) avoids overexpansion of neighborhoods to overcome the redundancy of nodewise sampling. Nodes in each layer only have directed edges towards nodes of the next layer, thus bounding the maximum amount of computation to per layer. Moreover, sharing common neighbors prevents feature replication across the batch, drastically reducing the memory complexity during training.
Graphwise sampling Chiang et al. (2019); Zeng et al. (2019) further advance feature sharing: each batch consists of a connected subgraph and at each training iteration the GNN model is optimized over all nodes in the subgraph. In ClusterGCN Chiang et al. (2019), nonoverlapping clusters are computed as a preprocessing step and then sampled during training as input minibatches. GraphSAINT Zeng et al. (2019)
adopts a similar approach, while also correcting for the bias and variance of the minibatch estimators when sampling subgraphs for training. It also explores different schemes to sample the subgraphs such as a random walkbased sampler, which is able to cosample nodes having high influence on each other and guarantees each edge has a nonnegligible probability of being sampled. At the time of writing, GraphSAINT is the stateoftheart method for large graphs.
3 Scalable Inception Graph Neural Networks
In this work we propose SIGN, an alternative method to scale graph neural networks to extremely large graphs. SIGN is not based on sampling nodes or subgraphs as these operations introduce bias into the optimization procedure.
We take inspiration from two recent findings: (i) despite its simplicity, the SGCN (3) model appears to be extremely efficient and to attain similar results to models with multiple stacked convolutional layers Wu et al. (2019); (ii) GCN aggregation schemes (2) have been essentially shown to learn lowpass filters NT and Maehara (2019) while still performing on par with models with more complex aggregation functions in the task of semisupervised node classification Shchur et al. (2018b).
Accordingly, we propose the following architecture for nodewise classification:
(4) 
where is a fixed graph diffusion matrix, , are learnable matrices respectively of dimensions and for classes, is the concatenation operataion and ,
are nonlinearities, the second one computing class probabilities, e.g. via softmax or sigmoid function, depending on the task at hand. Note that the model in equation (
4) is analogous to the popular Inception module Szegedy et al. (2015) for classic CNN architectures (Figure 2): it consists of convolutional filters of different sizes determined by the parameter , where corresponds to convolutions in the inception module (amounting to linear transformations of the features in each node without diffusion across nodes). Owing to this analogy, we refer to our model as the Scalable Inception Graph Network (SIGN). We notice that one work extending the idea of an Inception module to GNNs is the one in Kazi et al. (2019); in this work, however, authors do not discuss the inclusion of a linear, nondiffusive term () which effectively accounts for a skip connection. Additionally, the focus is not on scaling the model to large graphs, but rather on capturing intra and intergraph structural heterogeneity.Generalization of other models. It is also easy to observe that various graph convolutional layers can be obtained as particular settings of (4). In particular, by setting the nonlinearity to PReLU, that is
where is a learnable parameter, ChebNet, GCN, and SGCN can be automatically learnt if suitable diffusion operator and activation are used (see Table 1).
ChebNet Defferrard et al. (2016)  

GCN Kipf and Welling (2017)  
SGCN Wu et al. (2019)  1 
Efficient computation. Finally and most importantly, we observe that the matrix products in equation (4) do not depend
on the learnable model parameters and can be easily precomputed. For large graphs distributed computing infrastructures such as Apache Spark can speed up computation. This effectively reduces the computational complexity of the overall model to that of a multilayer perceptron
^{1}^{1}1i.e. , where is the number of features, the number of nodes in the training/testing graph and is the overall number of feedforward layers in the model..Table 2 compares the complexity of our SIGN model to other scalable architectures GraphSAGE, ClusterGCN, and GraphSAINT.
Preprocessing  Forward Pass  

GraphSAGE Hamilton et al. (2017b)  
ClusterGCN Chiang et al. (2019)  
GraphSAINT Zeng et al. (2019)  
SIGN (Ours) 
is the number of sampled neighbors per node. The forward pass complexity corresponds to an entire epoch where all nodes are seen.
4 Experiments
4.1 Datasets
Our method is applicable to both transductive and inductive learning. In the inductive setting, test and validation nodes are held out at training time, i.e. training nodes are only connected to other training nodes. During model evaluation, on the contrary, all the original edges are considered. In the transductive (semisupervised) setting, all the nodes are seen at training time, even though only a small fraction of the nodes have training labels.
Inductive experiments are performed using four large public datasets: Reddit, Flickr, Yelp and PPI. Introduced in Hamilton et al. (2017b), Reddit is the gold standard benchmark for GNNs on large scale graphs. Flickr and Yelp were introduced in Zeng et al. (2019) and PPI in Zitnik and Leskovec (2017). In agreement with Zeng et al. (2019), we confirm that the performance of a variety of models on the last three datasets is unstable, meaning that large variations in the results are observed for very small changes in architecture and optimization parameters. We hypothesize that this is due to errors in the data, or to unfortunate a priori choices of the training, validation, and test splits. We still report results on these datasets for the purpose of comparing with the work in Zeng et al. (2019). Amongst the considered inductive datasets, Reddit and Flickr are multiclass nodewise classification problems: in the former, the task is to predict communities of online posts based on user comments; in the latter, the task is image categorization based on the description and common properties of online images. Yelp and PPI are multilabel classification problems: the objective of the former is to predict business attributes based on customer reviews, while the later task consists of predicting protein functions from the interactions of human tissue proteins.
While our focus is on large graphs, we also experiment with smaller, but well established transductive datasets to compare SIGN to traditional GNN methods: AmazonComputersShchur et al. (2018a), AmazonPhotosShchur et al. (2018a), and CoauthorCSShchur et al. (2018a). These datasets are used in the most recent stateoftheart methods evaluation presented in Klicpera et al. (2019). Following Klicpera et al. (2019), we use 20 training nodes per class; 1500 validation nodes are in AmazonPhotos and AmazonComputers and 5000 in CoauthorCS. Dataset statistics are reported in Tables 4 and 5.
4.2 Setup
For all datasets we use an inception convolutional layer with , PReLU activation He et al. (2015) and diffusion operator with . To allow for larger model capacity in the inception module and in computing final model predictions, we replace the singlelayer projections performed by parameters and with multiple feedforward layers. Model outputs for multiclass classification problems were normalized via softmax; for the multilabel classification tasks we use elementwise sigmoid functions. Model parameters are found by minimizing the crossentropy loss via minibatch gradient descent with the Adam optimizer Kingma and Ba (2014) and an early stopping patience of , i.e. we stop training if the validation loss does not decrease for consecutive evaluation phases. In order to limit overfitting, we apply the standard regularization techniques of weight decay and dropout Srivastava et al. (2014)
. Additionally, batchnormalization
Ioffe and Szegedy (2015)has been used in every layer to stabilize training and increase convergence speed. Model hyperparameters (weight decay coefficient, dropout rate, hidden layer sizes, batch size, learning rate, number of feedforward layers in the inception module, number of feedforward layers for the final classification) are optimized on the validation sets using bayesian optimization with a tree parzen estimator surrogate function
Bergstra et al. (2011). Table 3 shows the hyperparameter ranges defining the search space.Learning Rate  Batch Size  Dropout  Weight Decay  Hidden Dimensions  Inception FF  Classification FF  
Range  [0.00001, 0.01]  [32, 2048]  [0, 1]  [0, 0.01]  [16, 1000]  [1, 2]  [1, 2] 
Avg. Degree  Classes  Train / Val / Test  

232,965  11,606,919  50  602  41(s)  0.66 / 0.10 / 0.24  
Yelp  716,847  6,977,410  10  300  100(m)  0.75 / 0.10 / 0.15 
Flickr  89,250  899,756  10  500  7(s)  0.50 / 0.25 / 0.25 
PPI  14,755  225,270  15  50  121(m)  0.66 / 0.12 / 0.22 
Avg. Degree  Classes  Label rate  

Computers  13,381  245,778  35.76  767  10(s)  0.015 
Photos  7487  119,043  31,13  745  8(s)  0.021 
CoauthorCS  18,333  81,894  8.93  6805  15(s)  0.016 
Preprocessing  Inference  

ClusterGCN  415.29 5.83  9.23 0.10 
GraphSAINT  34.29 0.06  3.47 0.03 
SIGN (Ours)  234.27 3.79  0.17 0.00 
Mean and standard deviation of preprocessing and inference time in seconds on Reddit computed over 10 runs.
Baselines. On the large scale inductive datasets, we compare our method to GCN Kipf and Welling (2017), FastGCN Chen et al. (2018), StochasticGCN Chen and Zhu (2018), ASGCN Huang et al. (2018), GraphSAGE Hamilton et al. (2017b), ClusterGCN Chiang et al. (2019), and GraphSAINT Zeng et al. (2019), which constitute the current stateoftheart. To make the comparison fair, all models have 2 graph convolutional layers. The results for the baselines are reported from Zeng et al. (2019). On the smaller transductive datasets, we compare to the well established methods GCN Kipf and Welling (2017), GAT Velickovic et al. (2018), JK Xu et al. (2018), GIN Xu et al. (2019), ARMA Bianchi et al. (2019b), and the current stateoftheart DIGL Klicpera et al. (2019).
4.3 Results
Inductive. Table 8 presents the results on the inductive datasets. In line with Zeng et al. (2019), we report the microaveraged F1 score means and standard deviations computed over 10 runs. Here we match the stateoftheart accuracy on Reddit, while consistently performing competitively on other datasets.
Runtime. While being comparable in terms of accuracy, our method has the advantage of being significantly faster than other methods for large graphs, both in training and inference. In Figure 4, we plot the validation F1 score on Reddit from the start of the training as a function of runtime. SIGN converges faster than both other methods, while also converging to a much better F1 score than ClusterGCN. We don’t report runtime results for GraphSAGE as it is substantially slower than other methods Chiang et al. (2019).
In Table 6, we report the preprocessing time needed to extract the data and prepare it for training, as well as the inference time on the entire test set for Reddit. For this experiment, we used the authors published code. While being slightly slower than GraphSAINT in the preprocessing phase, SIGN takes a fraction of the time for inference, outperforming GraphSAINT and GraphSAGE by over two orders of magnitude. It is important to note that while our implementation is in Pytorch, the implementations of GraphSAINT^{2}^{2}2https://github.com/GraphSAINT/GraphSAINT and ClusterGCN^{3}^{3}3https://github.com/googleresearch/googleresearch/tree/master/cluster_gcn
are in Tensorflow, which according to
Chiang et al. (2019), is up to six times faster than PyTorch. Moreover, while GraphSAINT’s preprocessing is parallelized, ours is not. Aiming at a further performance speedup, a TensorFlow implementation of our model, together with parallelization of the preprocessing routines, is left as future work.Transductive. To further validate our method, we compare it to classic, as well as stateoftheart, (nonscalable) GNN methods on well established small scale benchmarks.
Table 7 presents the results on these transductive datasets, averaged over 100 different train/val/test splits. While our focus is on nodewise classification on large scale graphs, we show that SIGN is competitive also on smaller wellestablished transductive benchmarks, outperforming classical methods and getting close to the current stateoftheart method (DIGL). This suggests that while being scalable and fast – therefore wellsuited to large scale applications – it can also be effective to tackle problems involving modest sized graphs.
Effect of convolution size . We perform a sensitivity analysis on the power parameter r, defining the size of the largest convolutional filter in the inception layer. On Reddit, works best and we keep this configuration on all datasets. Figure 4 depicts the convergence test F1 scores as a function of . It is interesting to see that while the model with is a generalization of the model with , increasing is actually detrimental in this case. We hypothesize this is due to the features aggregated from the 3hop neighborhood of a node not being informative, but actually misleading for the model.
AmazonComputer  AmazonPhoto  CoauthorCS  

GCN Kipf and Welling (2017)  84.75 0.23  92.08 0.20  91.83 0.08 
GAT Velickovic et al. (2018)  45.37 4.20  53.40 5.49  90.89 0.13 
JK Xu et al. (2018)  83.33 0.27  91.07 0.26  91.11 0.09 
GIN Xu et al. (2019)  55.44 0.83  68.34 1.16   
ARMA Bianchi et al. (2019b)  84.36 0.26  91.41 0.22  91.32 0.08 
DIGL Klicpera et al. (2019)  86.67 0.21  92.93 0.21  92.97 0.07 
SIGN (Ours)  85.93 1.21  91.72 1.20  91.98 0.50 
Flickr  PPI  Yelp  

GCN Kipf and Welling (2017)  0.933 0.000  0.492 0.003  0.515 0.006  0.378 0.001 
FastGCN Chen et al. (2018)  0.924 0.001  0.504 0.001  0.513 0.032  0.265 0.053 
StochasticGCN Chen and Zhu (2018)  0.964 0.001  0.482 0.003  0.963 0.010  0.640 0.002 
ASGCN Huang et al. (2018)  0.958 0.001  0.504 0.002  0.687 0.012   
GraphSage Hamilton et al. (2017b)  0.953 0.001  0.501 0.013  0.637 0.006  0.634 0.006 
ClusterGCN Chiang et al. (2019)  0.954 0.001  0.481 0.005  0.875 0.004  0.609 0.005 
GraphSAINT Zeng et al. (2019)  0.966 0.001  0.511 0.001  0.981 0.004  0.653 0.003 
SIGN (Ours)  0.966 0.003  0.503 0.003  0.965 0.002  0.623 0.005 
5 Conclusion and Future Work
Our results are consistent with previous reports Shchur et al. (2018b) advocating in favor of simple architectures (with just a single graph convolutional layer) in graph learning tasks. Our architecture achieves a good trade off between simplicity, allowing efficient and scalable applications to very large graphs, and expressiveness achieving competitive performance in a variety of applications. For this reason, SIGN is well suited to industrial largescale systems. Our architecture achieves competitive results on common graph learning benchmarks, while being significantly faster in training and up to two orders of magnitude faster in inference than other scalable approaches.
Extensions. Though in this paper we applied our model to the supervised nodewise classification setting, it is generic and can also be used for graphwise classification tasks and unsupervised representation learning (e.g. graph autoencoders). The latter is a particularly important setting in recommender systems Berg et al. (2018).
While we focused our discussion on undirected graphs for the sake of simplicity, our model is straighforwardly applicable to directed graphs, in which case is a nonsymmetric diffusion operator. Furthermore, more complex aggregation operations e.g. higherorder Laplacians Barbarossa and Sardellitti (2019) or directed diffusion based on graph motifs Monti et al. (2018) can be straightforwardly incorporated as additional channels in the inception module.
Finally, while our method relies on linear graph aggregation operations of the form for efficient precomputation, it is possible to make the diffusion operator dependent on the node features (and edge features, if available) as a matrix of the form .
Limitations. Graph attention Veselkov and others (2019) and similar mechanisms Monti et al. (2017) require a more elaborate parametric aggregation operator of the form , where are learnable parameters. This precludes efficient precomputation, which is key to the efficiency of our approach. Attention can be implemented in our scheme by training on a small subset of the graph to first determine the attention parameters, then fixing them to precompute the diffusion operator that is used during training and inference. For the same reason, it is easy to do only one graph convolutional layer, though the architecture supports multiple linear layers. Architectures with many convolutional layers are achievable by layerwise training.
References
 Topological signal processing over simplicial complexes. arXiv:1907.11577. Cited by: §5.
 Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261. Cited by: §1.
 Graph convolutional matrix completion. In KDD, Cited by: §5.
 Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2546–2554. External Links: Link Cited by: §4.2.
 Mincut pooling in graph neural networks. arXiv:1907.00481. Cited by: §2.1.
 Graph neural networks with convolutional ARMA filters. CoRR abs/1901.01343. External Links: Link, 1901.01343 Cited by: §4.2, Table 7.
 Geometric deep learning: going beyond euclidean data. IEEE Signal Proc. Magazine 34 (4), pp. 18–42. External Links: ISSN 10535888 Cited by: §1.
 Spectral networks and locally connected networks on graphs. In ICLR, Cited by: §2.3, §2.3.
 Stochastic training of graph convolutional networks. Cited by: §1, §2.4, §4.2, Table 8.
 FastGCN: fast learning with graph convolutional networks via importance sampling. Cited by: §1, §2.4, §4.2, Table 8.
 Clustergcn: an efficient algorithm for training deep and large graph convolutional networks. Cited by: §1, §2.4, Table 2, Figure 4, §4.2, §4.3, §4.3, Table 8.
 Graph neural networks for icecube signal classification. In ICMLA, Cited by: §1.
 Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, Cited by: §1, §2.3, §2.3, Table 1.
 Weighted graph cuts without eigenvectors a multilevel approach. PAMI 29 (11), pp. 1944–1957. Cited by: §2.1.
 Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §1.
 Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods 17, pp. 184–192. Cited by: §1.
 Neural message passing for quantum chemistry. In ICML, Cited by: §1.
 Node2vec: scalable feature learning for networks. In KDD, Cited by: §2.1.
 Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin. Cited by: §1.
 Inductive representation learning on large graphs. Cited by: §1, §1, §2.4, Table 2, §4.1, §4.2, Table 8.

Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. External Links: 1502.01852 Cited by: §4.2.  Deep convolutional networks on graphstructured data. arXiv:1506.05163. Cited by: §2.3.
 Adaptive sampling towards fast graph representation learning. Cited by: §1, §2.4, §4.2, Table 8.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine LearningNIPSICLRKDDNIPSWorkshop on Mining and Learning with GraphsInternational Conference on Learning RepresentationsICMLKDDInternational Conference on Learning RepresentationsNIPSKDDInformation Processing in Medical Imaging, F. Bach, D. Blei, A. C. S. Chung, J. C. Gee, P. A. Yushkevich, and S. Bao (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Link Cited by: §4.2.
 InceptionGCN: receptive field aware graph convolutional network for disease prediction. Cham, pp. 73–85. External Links: ISBN 9783030203511 Cited by: §1, §3.
 Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §4.2.
 Variational graph autoencoders. Cited by: §2.1.
 Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §2.3, Table 1, §4.2, Table 7, Table 8.
 Diffusion improves graph learning. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §4.1, §4.2, Table 7.
 Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §2.3.
 Transferability of spectral graph convolutional neural networks. arXiv:1907.12972. Cited by: §2.3.
 Cayleynets: graph convolutional neural networks with complex rational spectral filters. Trans. Signal Proc. 67 (1), pp. 97–109. Cited by: §2.3.
 Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, Cited by: §1, §1.
 Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, Cited by: §5.
 Fake news detection on social media using geometric deep learning. CoRR abs/1902.06673. External Links: Link, 1902.06673 Cited by: §1.
 Motifnet: a motifbased graph convolutional network for directed graphs. In DSW, Cited by: §5.
 Revisiting graph neural networks: all we have is lowpass filters. arXiv:1905.09550. Cited by: §3.
 Disease prediction using graph convolutional networks: application to autism spectrum disorder and alzheimer’s disease. Med Image Anal 48, pp. 117–130. Cited by: §1.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlcheBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.2.
 Deepwalk: online learning of social representations. In KDD, Cited by: §2.1.
 Learning humanobject interactions by graph parsing neural networks. In ECCV, pp. 401–417. Cited by: §1.
 NcRNA classification with graph convolutional networks. In KDD Workshop on Deep Learning on Graphs, Cited by: §1.
 The graph neural network model. Trans. Neural Networks 20 (1), pp. 61–80. Cited by: §1.
 Collective classification in network data. AI Magazine 29 (3), pp. 93–93. Cited by: §1.
 Pitfalls of graph neural network evaluation. CoRR abs/1811.05868. External Links: Link, 1811.05868 Cited by: §4.1.
 Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop. Cited by: §1, §3, §5.
 Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: ISSN 15324435 Cited by: §4.2.
 Going deeper with convolutions. In CVPR, Cited by: §1, §3.
 Graph attention networks. In ICLR, Cited by: §1, §4.2, Table 7.
 HyperFoods: machine intelligent mapping of cancerbeating molecules in foods. Scientific Reports 9 (1), pp. 1–12. Cited by: §1, §5.
 Simplifying graph convolutional networks. Cited by: §1, §1, §2.3, Table 1, §3.
 How powerful are graph neural networks?. External Links: Link Cited by: §4.2, Table 7.
 Representation learning on graphs with jumping knowledge networks. pp. . Cited by: §4.2, Table 7.
 Graph convolutional neural networks for webscale recommender systems. Cited by: §1, §2.4.
 Hierarchical graph representation learning with differentiable pooling. In NeurIPS, Cited by: §2.1.
 GraphSAINT: graph sampling based inductive learning method. arXiv:1907.04931. Cited by: §2.4, Table 2, §4.1, §4.2, §4.3, Table 8.
 Link prediction based on graph neural networks. In NIPS, Cited by: §1.
 Deep learning on graphs: a survey. arXiv:1812.04202. Cited by: §1.
 Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1.
 Predicting multicellular function through multilayer tissue networks. Bioinformatics 33 (14), pp. i190–i198. External Links: ISSN 14602059, Link, Document Cited by: §4.1.
Comments
There are no comments yet.