SIGN: Scalable Inception Graph Neural Networks

04/23/2020 ∙ by Emanuele Rossi, et al. ∙ Twitter 22

Geometric deep learning, a novel class of machine learning algorithms extending classical deep learning architectures to non-Euclidean structured data such as manifolds and graphs, has recently been applied to a broad spectrum of problems ranging from computer graphics and chemistry to high energy physics and social media. In this paper, we propose SIGN, a scalable graph neural network analogous to the popular inception module used in classical convolutional architectures. We show that our architecture is able to effectively deal with large-scale graphs via pre-computed multi-scale neighborhood features. Extensive experimental evaluation on various open benchmarks shows the competitive performance of our approach compared to a variety of popular architectures, while requiring a fraction of training and inference time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning on graphs, also known as geometric deep learning (GDL) or graph representation learning

(GRL), has emerged in a matter of just a few years from a niche topic to one of the most prominent fields in machine learning. Graph convolutional neural networks (GCNs), which can be traced back to the seminal work of Scarselli et al.

Scarselli et al. (2008), seek to generalize classical convolutional architectures (CNNs) to graph-structured data. A wide variety of convolution-like operations have been developed on graphs, including ChebNet Defferrard et al. (2016), MoNet Monti et al. (2016), GCN Kipf and Welling (2017), S-GCN Wu et al. (2019), GAT Velickovic et al. (2018), and GraphSAGE Hamilton et al. (2017b). We refer the reader to recent review papers Bronstein et al. (2017); Hamilton et al. (2017a); Battaglia and others (2018); Zhang et al. (2018) for a comprehensive overview of deep learning on graphs and its mathematical underpinnings.

Graph deep learning models have been extremely successful in modeling relational data in a variety of different domains, including social network link prediction Zhang and Chen (2018), human-object interaction Qi et al. (2018), computer graphics Monti et al. (2016), particle physics Choma et al. (2018), chemistry Duvenaud et al. (2015); Gilmer et al. (2017), medicine Parisot et al. (2018), drug repositioning Zitnik et al. (2018), discovery of anti-cancer foods Veselkov and others (2019), modeling of proteins Gainza and others (2019) and nucleic acids Rossi et al. (2019), and fake news detection on social media Monti et al. (2019) to mention a few. Somewhat surprisingly, often very simple architectures perform well in many applications Shchur et al. (2018b). In particular, graph convolutional networks (GCN) Kipf and Welling (2017) and their more recent variant S-GCN Wu et al. (2019)

apply a shared node-wise linear transformation of the node features, followed by one or more iterations of diffusion on the graph.

Until recently, most of the research in the field has focused on small-scale datasets (CORA Sen et al. (2008) with only K nodes still being among the most widely used), and relatively little effort has been devoted to scaling these methods to web-scale graphs such as the Facebook or Twitter social networks. Scaling is indeed a major challenge precluding the wide application of graph deep learning methods in large-scale industrial settings: compared to Euclidean neural networks where the training loss can be decomposed into individual samples and computed independently, graph convolutional networks diffuse information between nodes along the edges of the graph, making the loss computation interdependent for different nodes. Furthermore, in typical graphs the number of nodes grows exponentially with the increase of the filter receptive field, incurring significant computational and memory complexity.

Graph sampling approaches Hamilton et al. (2017b); Ying et al. (2018a); Chen et al. (2018); Huang et al. (2018); Chen and Zhu (2018) attempt to alleviate the cost of training graph neural networks by selecting a small number of neighbors. GraphSAGE Hamilton et al. (2017b) uniformly randomly samples the neighborhood of a given node. PinSAGE Ying et al. (2018a) uses random walks to improve the quality of such approximation. ClusterGCN Chiang et al. (2019) clusters the graph and enforces diffusion of information only within the computed clusters. GraphSAINT Chiang et al. (2019)

uses unbiased estimators of neighborhood graphs. They propose multiple methods to sample minibatch subgraphs during training while using normalization technique to eliminate bias.

In this paper, we propose a simple scalable graph neural network architecture generalizing GCN, S-GCN, ChebNet and related methods. Our architecture is analogous to the inception module Szegedy et al. (2015); Kazi et al. (2019) and combines graph convolutional filters of different size that are amenable to efficient precomputation, allowing extremely fast training and inference. Furthermore, our architecture is compatible with various sampling approaches.

We provide extensive experimental validation showing that, despite its simplicity, our approach produces comparable results to state-of-the-art architectures on a variety of large-scale graph datasets while being significantly faster (orders of magnitude) in training and inference.

2 Background and Related Work

2.1 Deep learning on graphs

Broadly speaking, the goal of graph representation learning is to construct a set of features (‘embeddings’) representing the structure of the graph and the data thereon. We can distinguish among Node-wise embeddings, representing each node of the graph, Edge-wise embeddings, representing each edge in the graph, and Graph-wise embeddings representing the graph as a whole. In the context of node-wise prediction problems (e.g. node-wise classification tasks), we can make the distinction between the following different settings or problems. Transductive learning assumes that the entire graph is known, and thus the same graph is used during training and testing (albeit different nodes are used for training and testing). In the Inductive setting, training and testing are performed on different graphs. Supervised learning uses a training set of labeled nodes (or graphs, respectively) and tries to predict these labels on a test set. The goal of Unsupervised

learning is to compute a representation of the nodes (or the graph, respectively) capturing the underlying structure. Typical representatives of this class of architectures are graph autoencoders

Kipf and Welling (2016) and random walk-based embeddings Grover and Leskovec (2016); Perozzi et al. (2014).

A typical graph neural network architecture consists of graph Convolution-like operators (discussed in details in Section 2.3) performing local aggregation of features by means of message passing with the neighbor nodes, and possibly Pooling amounting to fixed Dhillon et al. (2007) or learnable Ying et al. (2018b); Bianchi et al. (2019a) graph coarsening. Additionally, graph Sampling schemes (detailed in Section 2.4) can be employed on large-scale graphs to reduce the computational complexity.

2.2 Basic notions

Let be an undirected weighted graph, represented by the symmetric adjacency matrix , where if and zero otherwise. The diagonal degree matrix represents the number of neighbors of each node. We further assume that each node is endowed with a

-dimensional feature vector and arrange all the node features as rows of the

-dimensional matrix .

The normalized graph Laplacian is an positive semi-definite matrix . Given the -dimensional node feature matrix , the Laplacian amounts to computing the difference of the feature at each node with the local weighted average:

where is the neighborhood of node .

The Laplacian admits an eigendecomposition of the form

with orthogonal eigenvectors

and non-negative eigenvalues

arranged in increasing order . The eigenvectors play the role of a Fourier basis on the graph and the corresponding eigenvalues can be interpreted as frequencies. The

graph Fourier transform

is given by , and one can define the spectral analogy of a convolution on the graph as

where denotes the element-wise matrix product.

2.3 Convolution-like operators on graphs

Spectral graph CNNs.     Bruna et al. Bruna et al. (2014) used the graph Fourier transform to generalize convolutional neural networks (CNN) LeCun et al. (1989) to graphs. This approach has multiple drawbacks. First, the computation of the Fourier transform has complexity, in addition to precomputation of the eigenvectors . Second, the number of filter parameters is . Third, the filters are not guaranteed to be localized in the node domain. Fourth, the construction explicitly assumes the underlying graph to be undirected, in order for the Laplacian to be a symmetric matrix with orthogonal eigenvectors. Finally and most importantly, filters learned on one graph do not generalize to another.

ChebNet.     A way to address these issues is to model the filter as a transfer function , applied to the Laplacian as . Unlike the construction of Bruna et al. (2014) that does not generalize across graphs, the filter computed in the above manner is stable under graph perturbations Levie et al. (2019). If is a smooth function, the resulting filters are localized in the node domain Henaff et al. (2015). In the case when is expressed as simple matrix-vector operations (e.g. a polynomial Defferrard et al. (2016) or rational function Levie et al. (2018)), the eigendecomposition of the Laplacian can be avoided altogether.

A particularly simple choice is a polynomial spectral filter of degree , allowing the convolution to be computed entirely in the spatial domain as

(1)

Note that such a filter has parameters , does not require explicit multiplication by , and has a compact support of hops in the node domain (due to the fact that affects only neighbors within -hops). Though originating from a spectral construction, the resulting filter is an operation in the node domain amounting to a successive aggregation of features in the neighbor nodes. Second, using recursively-defined Chebyshev polynomials with and , the computation can be performed with complexity for sparsely-connected graphs. Finally, the polynomial filters can be combined with non-linearities, concatenated in multiple layers, and interleaved with pooling layers based on graph coarsening Defferrard et al. (2016).

GCN.     In the case , equation (1) reduces to computing , which can be interpreted as a combination of the node features and their diffused version. Kipf and Welling Kipf and Welling (2017) proposed a model of graph convolutional networks (GCN) combining node-wise and graph diffusion operations:

(2)

Here is the adjacency matrix with self-loops, is the respective degree matrix, and is a matrix of learnable parameters.

S-GCN.     Stacking GCN layers with element-wise non-linearity

and a final softmax layer for node classification, it is possible to obtain filters with larger receptive fields on the graph nodes,

Wu et al. Wu et al. (2019) argued that graph convolutions with large filters is practically equivalent to multiple convolutional layers with small filters. They showed that all but the last non-linearities can be removed without harming the performance, resulting in the simplified GCN (S-GCN) model,

(3)

2.4 Graph sampling

A characteristic of many graphs, in particular ‘small-world’ social networks, is the exponential growth of the neighborhood size with number of hops . In this case, the matrix becomes dense very quickly even for small values of . For Web-scale graphs such as Facebook or Twitter that typically have nodes and edges, the diffusion matrix cannot be stored in memory for training. In such a scenario, classic Graph Convolutional Neural Networks models such as GCN, GAT or MoNet are not applicable. Graph sampling has been shown to be a successful technique to scale GNNs to large graphs, by approximating with a matrix that has a significantly simpler structure amenable for computation. Generally, graph sampling produces a graph such that and . We can distinguish between three types of sampling schemes (Figure 2):

Node-wise sampling strategies perform graph convolutions on partial node neighborhoods to reduce computational and memory complexity. For each node of the graph, a selection criterion is employed to sample a fixed number of neighbors involved in the convolution operation and an aggregation tree is constructed out of the extracted nodes.

To overcome memory limitations, node-wise sampling strategies are coupled with minibatch training, where each training step is performed only on a batch of nodes rather than on the whole graph. A training batch is assembled by first choosing ‘optimization’ nodes (marked in orange in Figure 2, left), and partially expanding their corresponding neighborhoods. In a single training step, the loss is computed and optimized only for optimization nodes.

Node-wise sampling coupled with minibatch training was first introduced in GraphSAGE Hamilton et al. (2017b) to address the challenges of scaling GNNs. PinSAGE Ying et al. (2018a) extended GraphSAGE by exploiting a neighbor selection method using scores from approximations of personalized PageRank via random walks. VR-GCN Chen and Zhu (2018)

uses control variates to reduce the variance of stochastic training and increase the speed of convergence with a small number of neighbors.

Layer-wise sampling Chen et al. (2018); Huang et al. (2018) avoids over-expansion of neighborhoods to overcome the redundancy of node-wise sampling. Nodes in each layer only have directed edges towards nodes of the next layer, thus bounding the maximum amount of computation to per layer. Moreover, sharing common neighbors prevents feature replication across the batch, drastically reducing the memory complexity during training.

Graph-wise sampling Chiang et al. (2019); Zeng et al. (2019) further advance feature sharing: each batch consists of a connected subgraph and at each training iteration the GNN model is optimized over all nodes in the subgraph. In ClusterGCN Chiang et al. (2019), non-overlapping clusters are computed as a pre-processing step and then sampled during training as input minibatches. GraphSAINT Zeng et al. (2019)

adopts a similar approach, while also correcting for the bias and variance of the minibatch estimators when sampling subgraphs for training. It also explores different schemes to sample the subgraphs such as a random walk-based sampler, which is able to co-sample nodes having high influence on each other and guarantees each edge has a non-negligible probability of being sampled. At the time of writing, GraphSAINT is the state-of-the-art method for large graphs.

Figure 1: The SIGN architecture with the inception layer highlighted. represents the -th dense layer required for producing filters of radius , is the concatenation operation and the dense layer used to compute the final predictions.
Figure 2: Comparison of different sampling strategies. In node-wise and layer-wise sampling, only a fraction of the nodes in the batch are optimization nodes (in orange) where the loss is computed, graph-wise operates on entire subgraphs of optimization nodes, making more efficient use of each batch.

3 Scalable Inception Graph Neural Networks

In this work we propose SIGN, an alternative method to scale graph neural networks to extremely large graphs. SIGN is not based on sampling nodes or subgraphs as these operations introduce bias into the optimization procedure.

We take inspiration from two recent findings: (i) despite its simplicity, the S-GCN (3) model appears to be extremely efficient and to attain similar results to models with multiple stacked convolutional layers Wu et al. (2019); (ii) GCN aggregation schemes (2) have been essentially shown to learn low-pass filters NT and Maehara (2019) while still performing on par with models with more complex aggregation functions in the task of semi-supervised node classification Shchur et al. (2018b).

Accordingly, we propose the following architecture for node-wise classification:

(4)

where is a fixed graph diffusion matrix, , are learnable matrices respectively of dimensions and for classes, is the concatenation operataion and ,

are non-linearities, the second one computing class probabilities, e.g. via softmax or sigmoid function, depending on the task at hand. Note that the model in equation (

4) is analogous to the popular Inception module Szegedy et al. (2015) for classic CNN architectures (Figure 2): it consists of convolutional filters of different sizes determined by the parameter , where corresponds to convolutions in the inception module (amounting to linear transformations of the features in each node without diffusion across nodes). Owing to this analogy, we refer to our model as the Scalable Inception Graph Network (SIGN). We notice that one work extending the idea of an Inception module to GNNs is the one in Kazi et al. (2019); in this work, however, authors do not discuss the inclusion of a linear, non-diffusive term () which effectively accounts for a skip connection. Additionally, the focus is not on scaling the model to large graphs, but rather on capturing intra- and inter-graph structural heterogeneity.

Generalization of other models.    It is also easy to observe that various graph convolutional layers can be obtained as particular settings of (4). In particular, by setting the non-linearity to PReLU, that is

where is a learnable parameter, ChebNet, GCN, and S-GCN can be automatically learnt if suitable diffusion operator and activation are used (see Table 1).

ChebNet Defferrard et al. (2016)
GCN Kipf and Welling (2017)
S-GCN Wu et al. (2019) 1
Table 1: SIGN as a generalization of ChebNet, GCN and S-GCN. By appropriate configuration, our inception layer is able to replicate popular GNN layers.

Efficient computation.    Finally and most importantly, we observe that the matrix products in equation (4) do not depend

on the learnable model parameters and can be easily precomputed. For large graphs distributed computing infrastructures such as Apache Spark can speed up computation. This effectively reduces the computational complexity of the overall model to that of a multi-layer perceptron

111i.e. , where is the number of features, the number of nodes in the training/testing graph and is the overall number of feed-forward layers in the model..

Table 2 compares the complexity of our SIGN model to other scalable architectures GraphSAGE, ClusterGCN, and GraphSAINT.

Preprocessing Forward Pass
GraphSAGE Hamilton et al. (2017b)
ClusterGCN Chiang et al. (2019)
GraphSAINT Zeng et al. (2019)
SIGN (Ours)
Table 2: Time complexity where is the number of layers, is the filter size, the number of nodes (in training or inference, respectively), the number of edges, and the feature dimensionality (assumed fixed for all layers). For GraphSAGE,

is the number of sampled neighbors per node. The forward pass complexity corresponds to an entire epoch where all nodes are seen.

4 Experiments

4.1 Datasets

Our method is applicable to both transductive and inductive learning. In the inductive setting, test and validation nodes are held out at training time, i.e. training nodes are only connected to other training nodes. During model evaluation, on the contrary, all the original edges are considered. In the transductive (semi-supervised) setting, all the nodes are seen at training time, even though only a small fraction of the nodes have training labels.

Inductive experiments are performed using four large public datasets: Reddit, Flickr, Yelp and PPI. Introduced in Hamilton et al. (2017b), Reddit is the gold standard benchmark for GNNs on large scale graphs. Flickr and Yelp were introduced in Zeng et al. (2019) and PPI in Zitnik and Leskovec (2017). In agreement with Zeng et al. (2019), we confirm that the performance of a variety of models on the last three datasets is unstable, meaning that large variations in the results are observed for very small changes in architecture and optimization parameters. We hypothesize that this is due to errors in the data, or to unfortunate a priori choices of the training, validation, and test splits. We still report results on these datasets for the purpose of comparing with the work in Zeng et al. (2019). Amongst the considered inductive datasets, Reddit and Flickr are multiclass node-wise classification problems: in the former, the task is to predict communities of online posts based on user comments; in the latter, the task is image categorization based on the description and common properties of online images. Yelp and PPI are multilabel classification problems: the objective of the former is to predict business attributes based on customer reviews, while the later task consists of predicting protein functions from the interactions of human tissue proteins.

While our focus is on large graphs, we also experiment with smaller, but well established transductive datasets to compare SIGN to traditional GNN methods: Amazon-ComputersShchur et al. (2018a), Amazon-PhotosShchur et al. (2018a), and Coauthor-CSShchur et al. (2018a). These datasets are used in the most recent state-of-the-art methods evaluation presented in Klicpera et al. (2019). Following Klicpera et al. (2019), we use 20 training nodes per class; 1500 validation nodes are in Amazon-Photos and Amazon-Computers and 5000 in CoauthorCS. Dataset statistics are reported in Tables 4 and 5.

4.2 Setup

For all datasets we use an inception convolutional layer with , PReLU activation He et al. (2015) and diffusion operator with . To allow for larger model capacity in the inception module and in computing final model predictions, we replace the single-layer projections performed by parameters and with multiple feedforward layers. Model outputs for multiclass classification problems were normalized via softmax; for the multilabel classification tasks we use element-wise sigmoid functions. Model parameters are found by minimizing the cross-entropy loss via minibatch gradient descent with the Adam optimizer Kingma and Ba (2014) and an early stopping patience of , i.e. we stop training if the validation loss does not decrease for consecutive evaluation phases. In order to limit overfitting, we apply the standard regularization techniques of weight decay and dropout Srivastava et al. (2014)

. Additionally, batch-normalization 

Ioffe and Szegedy (2015)

has been used in every layer to stabilize training and increase convergence speed. Model hyperparameters (weight decay coefficient, dropout rate, hidden layer sizes, batch size, learning rate, number of feedforward layers in the inception module, number of feedforward layers for the final classification) are optimized on the validation sets using bayesian optimization with a tree parzen estimator surrogate function 

Bergstra et al. (2011). Table 3 shows the hyperparameter ranges defining the search space.

Learning Rate Batch Size Dropout Weight Decay Hidden Dimensions Inception FF Classification FF
Range [0.00001, 0.01] [32, 2048] [0, 1] [0, 0.01] [16, 1000] [1, 2] [1, 2]
Table 3: Ranges explored during the hyperparameter optimization in the form of [low, high]. Inception FF and Classification FF are the number of feedforward layers in the representation part of the model (replacing ) and the classification part of the model (replacing ) respectively.
Avg. Degree Classes Train / Val / Test
Reddit 232,965 11,606,919 50 602 41(s) 0.66 / 0.10 / 0.24
Yelp 716,847 6,977,410 10 300 100(m) 0.75 / 0.10 / 0.15
Flickr 89,250 899,756 10 500 7(s) 0.50 / 0.25 / 0.25
PPI 14,755 225,270 15 50 121(m) 0.66 / 0.12 / 0.22
Table 4: Summary of inductive datasets statistics with (s)ingle and (m)ulticlass.
Avg. Degree Classes Label rate
Computers 13,381 245,778 35.76 767 10(s) 0.015
Photos 7487 119,043 31,13 745 8(s) 0.021
Coauthor-CS 18,333 81,894 8.93 6805 15(s) 0.016
Table 5: Summary of transductive datasets statistics.
Preprocessing Inference
ClusterGCN 415.29 5.83 9.23 0.10
GraphSAINT 34.29 0.06 3.47 0.03
SIGN (Ours) 234.27 3.79 0.17 0.00
Table 6:

Mean and standard deviation of preprocessing and inference time in seconds on Reddit computed over 10 runs.

Baselines.     On the large scale inductive datasets, we compare our method to GCN Kipf and Welling (2017), FastGCN Chen et al. (2018), Stochastic-GCN Chen and Zhu (2018), AS-GCN Huang et al. (2018), GraphSAGE Hamilton et al. (2017b), ClusterGCN Chiang et al. (2019), and GraphSAINT Zeng et al. (2019), which constitute the current state-of-the-art. To make the comparison fair, all models have 2 graph convolutional layers. The results for the baselines are reported from Zeng et al. (2019). On the smaller transductive datasets, we compare to the well established methods GCN Kipf and Welling (2017), GAT Velickovic et al. (2018), JK Xu et al. (2018), GIN Xu et al. (2019), ARMA Bianchi et al. (2019b), and the current state-of-the-art DIGL Klicpera et al. (2019).

Implementation and machine specifications.    

All experiments, including timings, are run on an AWS p2.8xlarge instance, with 8 NVIDIA K80 GPUs, 32 vCPUs, a processor Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz and 488GiB of RAM. SIGN is implemented using Pytorch 

Paszke et al. (2019).

4.3 Results

Inductive.     Table 8 presents the results on the inductive datasets. In line with Zeng et al. (2019), we report the micro-averaged F1 score means and standard deviations computed over 10 runs. Here we match the state-of-the-art accuracy on Reddit, while consistently performing competitively on other datasets.

Runtime.     While being comparable in terms of accuracy, our method has the advantage of being significantly faster than other methods for large graphs, both in training and inference. In Figure 4, we plot the validation F1 score on Reddit from the start of the training as a function of runtime. SIGN converges faster than both other methods, while also converging to a much better F1 score than ClusterGCN. We don’t report runtime results for GraphSAGE as it is substantially slower than other methods Chiang et al. (2019).

In Table 6, we report the preprocessing time needed to extract the data and prepare it for training, as well as the inference time on the entire test set for Reddit. For this experiment, we used the authors published code. While being slightly slower than GraphSAINT in the preprocessing phase, SIGN takes a fraction of the time for inference, outperforming GraphSAINT and GraphSAGE by over two orders of magnitude. It is important to note that while our implementation is in Pytorch, the implementations of GraphSAINT222https://github.com/GraphSAINT/GraphSAINT and ClusterGCN333https://github.com/google-research/google-research/tree/master/cluster_gcn

are in Tensorflow, which according to

Chiang et al. (2019), is up to six times faster than PyTorch. Moreover, while GraphSAINT’s preprocessing is parallelized, ours is not. Aiming at a further performance speedup, a TensorFlow implementation of our model, together with parallelization of the preprocessing routines, is left as future work.

Transductive.    To further validate our method, we compare it to classic, as well as state-of-the-art, (non-scalable) GNN methods on well established small scale benchmarks.

Table 7 presents the results on these transductive datasets, averaged over 100 different train/val/test splits. While our focus is on node-wise classification on large scale graphs, we show that SIGN is competitive also on smaller well-established transductive benchmarks, outperforming classical methods and getting close to the current state-of-the-art method (DIGL). This suggests that while being scalable and fast – therefore well-suited to large scale applications – it can also be effective to tackle problems involving modest sized graphs.

Figure 3: Convergence in time for various methods on Reddit dataset. Our method attains much faster training convergence. We don’t report timing results for GraphSAGE as it is substantially slower than other methods Chiang et al. (2019).
Figure 4: Convergence test F1 scores on Reddit as a function of the convolution depth (power parameter r).

Effect of convolution size .     We perform a sensitivity analysis on the power parameter r, defining the size of the largest convolutional filter in the inception layer. On Reddit, works best and we keep this configuration on all datasets. Figure 4 depicts the convergence test F1 scores as a function of . It is interesting to see that while the model with is a generalization of the model with , increasing is actually detrimental in this case. We hypothesize this is due to the features aggregated from the 3-hop neighborhood of a node not being informative, but actually misleading for the model.

Amazon-Computer Amazon-Photo CoauthorCS
GCN Kipf and Welling (2017) 84.75 0.23 92.08 0.20 91.83 0.08
GAT Velickovic et al. (2018) 45.37 4.20 53.40 5.49 90.89 0.13
JK Xu et al. (2018) 83.33 0.27 91.07 0.26 91.11 0.09
GIN Xu et al. (2019) 55.44 0.83 68.34 1.16 -
ARMA Bianchi et al. (2019b) 84.36 0.26 91.41 0.22 91.32 0.08
DIGL Klicpera et al. (2019) 86.67 0.21 92.93 0.21 92.97 0.07
SIGN (Ours) 85.93 1.21 91.72 1.20 91.98 0.50
Table 7: Micro-averaged F1 score average and standard deviation over 100 train/val/test splits for different models, and a different model initialization for each split. The top three performance scores are highlighted as: First, Second, Third.
Reddit Flickr PPI Yelp
GCN Kipf and Welling (2017) 0.933 0.000 0.492 0.003 0.515 0.006 0.378 0.001
FastGCN Chen et al. (2018) 0.924 0.001 0.504 0.001 0.513 0.032 0.265 0.053
Stochastic-GCN Chen and Zhu (2018) 0.964 0.001 0.482 0.003 0.963 0.010 0.640 0.002
AS-GCN Huang et al. (2018) 0.958 0.001 0.504 0.002 0.687 0.012 -
GraphSage Hamilton et al. (2017b) 0.953 0.001 0.501 0.013 0.637 0.006 0.634 0.006
ClusterGCN Chiang et al. (2019) 0.954 0.001 0.481 0.005 0.875 0.004 0.609 0.005
GraphSAINT Zeng et al. (2019) 0.966 0.001 0.511 0.001 0.981 0.004 0.653 0.003
SIGN (Ours) 0.966 0.003 0.503 0.003 0.965 0.002 0.623 0.005
Table 8: Micro-averaged F1 score average and standard deviation over 10 runs with the same train/val/test split but different random model initialization. The top three performance scores are highlighted as: First, Second, Third.

5 Conclusion and Future Work

Our results are consistent with previous reports Shchur et al. (2018b) advocating in favor of simple architectures (with just a single graph convolutional layer) in graph learning tasks. Our architecture achieves a good trade off between simplicity, allowing efficient and scalable applications to very large graphs, and expressiveness achieving competitive performance in a variety of applications. For this reason, SIGN is well suited to industrial large-scale systems. Our architecture achieves competitive results on common graph learning benchmarks, while being significantly faster in training and up to two orders of magnitude faster in inference than other scalable approaches.

Extensions.     Though in this paper we applied our model to the supervised node-wise classification setting, it is generic and can also be used for graph-wise classification tasks and unsupervised representation learning (e.g. graph autoencoders). The latter is a particularly important setting in recommender systems Berg et al. (2018).

While we focused our discussion on undirected graphs for the sake of simplicity, our model is straighforwardly applicable to directed graphs, in which case is a non-symmetric diffusion operator. Furthermore, more complex aggregation operations e.g. higher-order Laplacians Barbarossa and Sardellitti (2019) or directed diffusion based on graph motifs Monti et al. (2018) can be straightforwardly incorporated as additional channels in the inception module.

Finally, while our method relies on linear graph aggregation operations of the form for efficient precomputation, it is possible to make the diffusion operator dependent on the node features (and edge features, if available) as a matrix of the form .

Limitations.     Graph attention Veselkov and others (2019) and similar mechanisms Monti et al. (2017) require a more elaborate parametric aggregation operator of the form , where are learnable parameters. This precludes efficient precomputation, which is key to the efficiency of our approach. Attention can be implemented in our scheme by training on a small subset of the graph to first determine the attention parameters, then fixing them to precompute the diffusion operator that is used during training and inference. For the same reason, it is easy to do only one graph convolutional layer, though the architecture supports multiple linear layers. Architectures with many convolutional layers are achievable by layer-wise training.

References

  • S. Barbarossa and S. Sardellitti (2019) Topological signal processing over simplicial complexes. arXiv:1907.11577. Cited by: §5.
  • P. W. Battaglia et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261. Cited by: §1.
  • R. v. d. Berg, T. N. Kipf, and M. Welling (2018) Graph convolutional matrix completion. In KDD, Cited by: §5.
  • J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2546–2554. External Links: Link Cited by: §4.2.
  • F. M. Bianchi, D. Grattarola, and C. Alippi (2019a) Mincut pooling in graph neural networks. arXiv:1907.00481. Cited by: §2.1.
  • F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi (2019b) Graph neural networks with convolutional ARMA filters. CoRR abs/1901.01343. External Links: Link, 1901.01343 Cited by: §4.2, Table 7.
  • M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Proc. Magazine 34 (4), pp. 18–42. External Links: ISSN 1053-5888 Cited by: §1.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2014) Spectral networks and locally connected networks on graphs. In ICLR, Cited by: §2.3, §2.3.
  • J. Chen and J. Zhu (2018) Stochastic training of graph convolutional networks. Cited by: §1, §2.4, §4.2, Table 8.
  • J. Chen, T. Ma, and C. Xiao (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. Cited by: §1, §2.4, §4.2, Table 8.
  • W. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C. Hsieh (2019) Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks. Cited by: §1, §2.4, Table 2, Figure 4, §4.2, §4.3, §4.3, Table 8.
  • N. Choma, F. Monti, L. Gerhardt, T. Palczewski, Z. Ronaghi, P. Prabhat, W. Bhimji, M. Bronstein, S. Klein, and J. Bruna (2018) Graph neural networks for icecube signal classification. In ICMLA, Cited by: §1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, Cited by: §1, §2.3, §2.3, Table 1.
  • I. S. Dhillon, Y. Guan, and B. Kulis (2007) Weighted graph cuts without eigenvectors a multilevel approach. PAMI 29 (11), pp. 1944–1957. Cited by: §2.1.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §1.
  • P. Gainza et al. (2019) Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods 17, pp. 184–192. Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §1.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In KDD, Cited by: §2.1.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017a) Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin. Cited by: §1.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017b) Inductive representation learning on large graphs. Cited by: §1, §1, §2.4, Table 2, §4.1, §4.2, Table 8.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    External Links: 1502.01852 Cited by: §4.2.
  • M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv:1506.05163. Cited by: §2.3.
  • W. Huang, T. Zhang, Y. Rong, and J. Huang (2018) Adaptive sampling towards fast graph representation learning. Cited by: §1, §2.4, §4.2, Table 8.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine LearningNIPSICLRKDDNIPSWorkshop on Mining and Learning with GraphsInternational Conference on Learning RepresentationsICMLKDDInternational Conference on Learning RepresentationsNIPSKDDInformation Processing in Medical Imaging, F. Bach, D. Blei, A. C. S. Chung, J. C. Gee, P. A. Yushkevich, and S. Bao (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Link Cited by: §4.2.
  • A. Kazi, S. Shekarforoush, S. Arvind Krishna, H. Burwinkel, G. Vivar, K. Kortüm, S. Ahmadi, S. Albarqouni, and N. Navab (2019) InceptionGCN: receptive field aware graph convolutional network for disease prediction. Cham, pp. 73–85. External Links: ISBN 978-3-030-20351-1 Cited by: §1, §3.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §4.2.
  • T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. Cited by: §2.1.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §2.3, Table 1, §4.2, Table 7, Table 8.
  • J. Klicpera, S. Weißenberger, and S. Günnemann (2019) Diffusion improves graph learning. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §4.1, §4.2, Table 7.
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §2.3.
  • R. Levie, M. M. Bronstein, and G. Kutyniok (2019) Transferability of spectral graph convolutional neural networks. arXiv:1907.12972. Cited by: §2.3.
  • R. Levie, F. Monti, X. Bresson, and M. M. Bronstein (2018) Cayleynets: graph convolutional neural networks with complex rational spectral filters. Trans. Signal Proc. 67 (1), pp. 97–109. Cited by: §2.3.
  • F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein (2016) Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, Cited by: §1, §1.
  • F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, Cited by: §5.
  • F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein (2019) Fake news detection on social media using geometric deep learning. CoRR abs/1902.06673. External Links: Link, 1902.06673 Cited by: §1.
  • F. Monti, K. Otness, and M. M. Bronstein (2018) Motifnet: a motif-based graph convolutional network for directed graphs. In DSW, Cited by: §5.
  • H. NT and T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv:1905.09550. Cited by: §3.
  • S. Parisot, S. I. Ktena, E. Ferrante, M. Lee, R. Guerrero, B. Glocker, and D. Rueckert (2018) Disease prediction using graph convolutional networks: application to autism spectrum disorder and alzheimer’s disease. Med Image Anal 48, pp. 117–130. Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alche-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §4.2.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In KDD, Cited by: §2.1.
  • S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, pp. 401–417. Cited by: §1.
  • E. Rossi, F. Monti, M. Bronstein, and P. Liò (2019) NcRNA classification with graph convolutional networks. In KDD Workshop on Deep Learning on Graphs, Cited by: §1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. Trans. Neural Networks 20 (1), pp. 61–80. Cited by: §1.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI Magazine 29 (3), pp. 93–93. Cited by: §1.
  • O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018a) Pitfalls of graph neural network evaluation. CoRR abs/1811.05868. External Links: Link, 1811.05868 Cited by: §4.1.
  • O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018b) Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop. Cited by: §1, §3, §5.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: ISSN 1532-4435 Cited by: §4.2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §1, §3.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §1, §4.2, Table 7.
  • K. Veselkov et al. (2019) HyperFoods: machine intelligent mapping of cancer-beating molecules in foods. Scientific Reports 9 (1), pp. 1–12. Cited by: §1, §5.
  • F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. Cited by: §1, §1, §2.3, Table 1, §3.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. External Links: Link Cited by: §4.2, Table 7.
  • K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. pp. . Cited by: §4.2, Table 7.
  • R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018a) Graph convolutional neural networks for web-scale recommender systems. Cited by: §1, §2.4.
  • Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018b) Hierarchical graph representation learning with differentiable pooling. In NeurIPS, Cited by: §2.1.
  • H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. K. Prasanna (2019) GraphSAINT: graph sampling based inductive learning method. arXiv:1907.04931. Cited by: §2.4, Table 2, §4.1, §4.2, §4.3, Table 8.
  • M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In NIPS, Cited by: §1.
  • Z. Zhang, P. Cui, and W. Zhu (2018) Deep learning on graphs: a survey. arXiv:1812.04202. Cited by: §1.
  • M. Zitnik, M. Agrawal, and J. Leskovec (2018) Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1.
  • M. Zitnik and J. Leskovec (2017) Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33 (14), pp. i190–i198. External Links: ISSN 1460-2059, Link, Document Cited by: §4.1.