SIGN: Scalable Inception Graph Network
Geometric deep learning, a novel class of machine learning algorithms extending classical deep learning architectures to non-Euclidean structured data such as manifolds and graphs, has recently been applied to a broad spectrum of problems ranging from computer graphics and chemistry to high energy physics and social media. In this paper, we propose SIGN, a scalable graph neural network analogous to the popular inception module used in classical convolutional architectures. We show that our architecture is able to effectively deal with large-scale graphs via pre-computed multi-scale neighborhood features. Extensive experimental evaluation on various open benchmarks shows the competitive performance of our approach compared to a variety of popular architectures, while requiring a fraction of training and inference time.READ FULL TEXT VIEW PDF
SIGN: Scalable Inception Graph Network
In this project I carried out at EURECOM university I deeply delve into the theory of Graph Convolutional Networks and explore solutions for anomaly detection on huge financial graphs.
Deep learning on graphs, also known as geometric deep learning (GDL) or graph representation learning
(GRL), has emerged in a matter of just a few years from a niche topic to one of the most prominent fields in machine learning. Graph convolutional neural networks (GCNs), which can be traced back to the seminal work of Scarselli et al.Scarselli et al. (2008), seek to generalize classical convolutional architectures (CNNs) to graph-structured data. A wide variety of convolution-like operations have been developed on graphs, including ChebNet Defferrard et al. (2016), MoNet Monti et al. (2016), GCN Kipf and Welling (2017), S-GCN Wu et al. (2019), GAT Velickovic et al. (2018), and GraphSAGE Hamilton et al. (2017b). We refer the reader to recent review papers Bronstein et al. (2017); Hamilton et al. (2017a); Battaglia and others (2018); Zhang et al. (2018) for a comprehensive overview of deep learning on graphs and its mathematical underpinnings.
Graph deep learning models have been extremely successful in modeling relational data in a variety of different domains, including social network link prediction Zhang and Chen (2018), human-object interaction Qi et al. (2018), computer graphics Monti et al. (2016), particle physics Choma et al. (2018), chemistry Duvenaud et al. (2015); Gilmer et al. (2017), medicine Parisot et al. (2018), drug repositioning Zitnik et al. (2018), discovery of anti-cancer foods Veselkov and others (2019), modeling of proteins Gainza and others (2019) and nucleic acids Rossi et al. (2019), and fake news detection on social media Monti et al. (2019) to mention a few. Somewhat surprisingly, often very simple architectures perform well in many applications Shchur et al. (2018b). In particular, graph convolutional networks (GCN) Kipf and Welling (2017) and their more recent variant S-GCN Wu et al. (2019)
apply a shared node-wise linear transformation of the node features, followed by one or more iterations of diffusion on the graph.
Until recently, most of the research in the field has focused on small-scale datasets (CORA Sen et al. (2008) with only K nodes still being among the most widely used), and relatively little effort has been devoted to scaling these methods to web-scale graphs such as the Facebook or Twitter social networks. Scaling is indeed a major challenge precluding the wide application of graph deep learning methods in large-scale industrial settings: compared to Euclidean neural networks where the training loss can be decomposed into individual samples and computed independently, graph convolutional networks diffuse information between nodes along the edges of the graph, making the loss computation interdependent for different nodes. Furthermore, in typical graphs the number of nodes grows exponentially with the increase of the filter receptive field, incurring significant computational and memory complexity.
Graph sampling approaches Hamilton et al. (2017b); Ying et al. (2018a); Chen et al. (2018); Huang et al. (2018); Chen and Zhu (2018) attempt to alleviate the cost of training graph neural networks by selecting a small number of neighbors. GraphSAGE Hamilton et al. (2017b) uniformly randomly samples the neighborhood of a given node. PinSAGE Ying et al. (2018a) uses random walks to improve the quality of such approximation. ClusterGCN Chiang et al. (2019) clusters the graph and enforces diffusion of information only within the computed clusters. GraphSAINT Chiang et al. (2019)
uses unbiased estimators of neighborhood graphs. They propose multiple methods to sample minibatch subgraphs during training while using normalization technique to eliminate bias.
In this paper, we propose a simple scalable graph neural network architecture generalizing GCN, S-GCN, ChebNet and related methods. Our architecture is analogous to the inception module Szegedy et al. (2015); Kazi et al. (2019) and combines graph convolutional filters of different size that are amenable to efficient precomputation, allowing extremely fast training and inference. Furthermore, our architecture is compatible with various sampling approaches.
We provide extensive experimental validation showing that, despite its simplicity, our approach produces comparable results to state-of-the-art architectures on a variety of large-scale graph datasets while being significantly faster (orders of magnitude) in training and inference.
Broadly speaking, the goal of graph representation learning is to construct a set of features (‘embeddings’) representing the structure of the graph and the data thereon. We can distinguish among Node-wise embeddings, representing each node of the graph, Edge-wise embeddings, representing each edge in the graph, and Graph-wise embeddings representing the graph as a whole. In the context of node-wise prediction problems (e.g. node-wise classification tasks), we can make the distinction between the following different settings or problems. Transductive learning assumes that the entire graph is known, and thus the same graph is used during training and testing (albeit different nodes are used for training and testing). In the Inductive setting, training and testing are performed on different graphs. Supervised learning uses a training set of labeled nodes (or graphs, respectively) and tries to predict these labels on a test set. The goal of Unsupervised
learning is to compute a representation of the nodes (or the graph, respectively) capturing the underlying structure. Typical representatives of this class of architectures are graph autoencodersKipf and Welling (2016) and random walk-based embeddings Grover and Leskovec (2016); Perozzi et al. (2014).
A typical graph neural network architecture consists of graph Convolution-like operators (discussed in details in Section 2.3) performing local aggregation of features by means of message passing with the neighbor nodes, and possibly Pooling amounting to fixed Dhillon et al. (2007) or learnable Ying et al. (2018b); Bianchi et al. (2019a) graph coarsening. Additionally, graph Sampling schemes (detailed in Section 2.4) can be employed on large-scale graphs to reduce the computational complexity.
Let be an undirected weighted graph, represented by the symmetric adjacency matrix , where if and zero otherwise. The diagonal degree matrix represents the number of neighbors of each node. We further assume that each node is endowed with a
-dimensional feature vector and arrange all the node features as rows of the-dimensional matrix .
The normalized graph Laplacian is an positive semi-definite matrix . Given the -dimensional node feature matrix , the Laplacian amounts to computing the difference of the feature at each node with the local weighted average:
where is the neighborhood of node .
The Laplacian admits an eigendecomposition of the form
with orthogonal eigenvectors
and non-negative eigenvaluesarranged in increasing order . The eigenvectors play the role of a Fourier basis on the graph and the corresponding eigenvalues can be interpreted as frequencies. The
graph Fourier transformis given by , and one can define the spectral analogy of a convolution on the graph as
where denotes the element-wise matrix product.
Spectral graph CNNs. Bruna et al. Bruna et al. (2014) used the graph Fourier transform to generalize convolutional neural networks (CNN) LeCun et al. (1989) to graphs. This approach has multiple drawbacks. First, the computation of the Fourier transform has complexity, in addition to precomputation of the eigenvectors . Second, the number of filter parameters is . Third, the filters are not guaranteed to be localized in the node domain. Fourth, the construction explicitly assumes the underlying graph to be undirected, in order for the Laplacian to be a symmetric matrix with orthogonal eigenvectors. Finally and most importantly, filters learned on one graph do not generalize to another.
ChebNet. A way to address these issues is to model the filter as a transfer function , applied to the Laplacian as . Unlike the construction of Bruna et al. (2014) that does not generalize across graphs, the filter computed in the above manner is stable under graph perturbations Levie et al. (2019). If is a smooth function, the resulting filters are localized in the node domain Henaff et al. (2015). In the case when is expressed as simple matrix-vector operations (e.g. a polynomial Defferrard et al. (2016) or rational function Levie et al. (2018)), the eigendecomposition of the Laplacian can be avoided altogether.
A particularly simple choice is a polynomial spectral filter of degree , allowing the convolution to be computed entirely in the spatial domain as
Note that such a filter has parameters , does not require explicit multiplication by , and has a compact support of hops in the node domain (due to the fact that affects only neighbors within -hops). Though originating from a spectral construction, the resulting filter is an operation in the node domain amounting to a successive aggregation of features in the neighbor nodes. Second, using recursively-defined Chebyshev polynomials with and , the computation can be performed with complexity for sparsely-connected graphs. Finally, the polynomial filters can be combined with non-linearities, concatenated in multiple layers, and interleaved with pooling layers based on graph coarsening Defferrard et al. (2016).
GCN. In the case , equation (1) reduces to computing , which can be interpreted as a combination of the node features and their diffused version. Kipf and Welling Kipf and Welling (2017) proposed a model of graph convolutional networks (GCN) combining node-wise and graph diffusion operations:
Here is the adjacency matrix with self-loops, is the respective degree matrix, and is a matrix of learnable parameters.
S-GCN. Stacking GCN layers with element-wise non-linearity
and a final softmax layer for node classification, it is possible to obtain filters with larger receptive fields on the graph nodes,
Wu et al. Wu et al. (2019) argued that graph convolutions with large filters is practically equivalent to multiple convolutional layers with small filters. They showed that all but the last non-linearities can be removed without harming the performance, resulting in the simplified GCN (S-GCN) model,
A characteristic of many graphs, in particular ‘small-world’ social networks, is the exponential growth of the neighborhood size with number of hops . In this case, the matrix becomes dense very quickly even for small values of . For Web-scale graphs such as Facebook or Twitter that typically have nodes and edges, the diffusion matrix cannot be stored in memory for training. In such a scenario, classic Graph Convolutional Neural Networks models such as GCN, GAT or MoNet are not applicable. Graph sampling has been shown to be a successful technique to scale GNNs to large graphs, by approximating with a matrix that has a significantly simpler structure amenable for computation. Generally, graph sampling produces a graph such that and . We can distinguish between three types of sampling schemes (Figure 2):
Node-wise sampling strategies perform graph convolutions on partial node neighborhoods to reduce computational and memory complexity. For each node of the graph, a selection criterion is employed to sample a fixed number of neighbors involved in the convolution operation and an aggregation tree is constructed out of the extracted nodes.
To overcome memory limitations, node-wise sampling strategies are coupled with minibatch training, where each training step is performed only on a batch of nodes rather than on the whole graph. A training batch is assembled by first choosing ‘optimization’ nodes (marked in orange in Figure 2, left), and partially expanding their corresponding neighborhoods. In a single training step, the loss is computed and optimized only for optimization nodes.
Node-wise sampling coupled with minibatch training was first introduced in GraphSAGE Hamilton et al. (2017b) to address the challenges of scaling GNNs. PinSAGE Ying et al. (2018a) extended GraphSAGE by exploiting a neighbor selection method using scores from approximations of personalized PageRank via random walks. VR-GCN Chen and Zhu (2018)
uses control variates to reduce the variance of stochastic training and increase the speed of convergence with a small number of neighbors.
Layer-wise sampling Chen et al. (2018); Huang et al. (2018) avoids over-expansion of neighborhoods to overcome the redundancy of node-wise sampling. Nodes in each layer only have directed edges towards nodes of the next layer, thus bounding the maximum amount of computation to per layer. Moreover, sharing common neighbors prevents feature replication across the batch, drastically reducing the memory complexity during training.
Graph-wise sampling Chiang et al. (2019); Zeng et al. (2019) further advance feature sharing: each batch consists of a connected subgraph and at each training iteration the GNN model is optimized over all nodes in the subgraph. In ClusterGCN Chiang et al. (2019), non-overlapping clusters are computed as a pre-processing step and then sampled during training as input minibatches. GraphSAINT Zeng et al. (2019)
adopts a similar approach, while also correcting for the bias and variance of the minibatch estimators when sampling subgraphs for training. It also explores different schemes to sample the subgraphs such as a random walk-based sampler, which is able to co-sample nodes having high influence on each other and guarantees each edge has a non-negligible probability of being sampled. At the time of writing, GraphSAINT is the state-of-the-art method for large graphs.
In this work we propose SIGN, an alternative method to scale graph neural networks to extremely large graphs. SIGN is not based on sampling nodes or subgraphs as these operations introduce bias into the optimization procedure.
We take inspiration from two recent findings: (i) despite its simplicity, the S-GCN (3) model appears to be extremely efficient and to attain similar results to models with multiple stacked convolutional layers Wu et al. (2019); (ii) GCN aggregation schemes (2) have been essentially shown to learn low-pass filters NT and Maehara (2019) while still performing on par with models with more complex aggregation functions in the task of semi-supervised node classification Shchur et al. (2018b).
Accordingly, we propose the following architecture for node-wise classification:
where is a fixed graph diffusion matrix, , are learnable matrices respectively of dimensions and for classes, is the concatenation operataion and ,
are non-linearities, the second one computing class probabilities, e.g. via softmax or sigmoid function, depending on the task at hand. Note that the model in equation (4) is analogous to the popular Inception module Szegedy et al. (2015) for classic CNN architectures (Figure 2): it consists of convolutional filters of different sizes determined by the parameter , where corresponds to convolutions in the inception module (amounting to linear transformations of the features in each node without diffusion across nodes). Owing to this analogy, we refer to our model as the Scalable Inception Graph Network (SIGN). We notice that one work extending the idea of an Inception module to GNNs is the one in Kazi et al. (2019); in this work, however, authors do not discuss the inclusion of a linear, non-diffusive term () which effectively accounts for a skip connection. Additionally, the focus is not on scaling the model to large graphs, but rather on capturing intra- and inter-graph structural heterogeneity.
Generalization of other models. It is also easy to observe that various graph convolutional layers can be obtained as particular settings of (4). In particular, by setting the non-linearity to PReLU, that is
where is a learnable parameter, ChebNet, GCN, and S-GCN can be automatically learnt if suitable diffusion operator and activation are used (see Table 1).
|ChebNet Defferrard et al. (2016)|
|GCN Kipf and Welling (2017)|
|S-GCN Wu et al. (2019)||1|
Efficient computation. Finally and most importantly, we observe that the matrix products in equation (4) do not depend
on the learnable model parameters and can be easily precomputed. For large graphs distributed computing infrastructures such as Apache Spark can speed up computation. This effectively reduces the computational complexity of the overall model to that of a multi-layer perceptron111i.e. , where is the number of features, the number of nodes in the training/testing graph and is the overall number of feed-forward layers in the model..
Table 2 compares the complexity of our SIGN model to other scalable architectures GraphSAGE, ClusterGCN, and GraphSAINT.
|GraphSAGE Hamilton et al. (2017b)|
|ClusterGCN Chiang et al. (2019)|
|GraphSAINT Zeng et al. (2019)|
is the number of sampled neighbors per node. The forward pass complexity corresponds to an entire epoch where all nodes are seen.
Our method is applicable to both transductive and inductive learning. In the inductive setting, test and validation nodes are held out at training time, i.e. training nodes are only connected to other training nodes. During model evaluation, on the contrary, all the original edges are considered. In the transductive (semi-supervised) setting, all the nodes are seen at training time, even though only a small fraction of the nodes have training labels.
Inductive experiments are performed using four large public datasets: Reddit, Flickr, Yelp and PPI. Introduced in Hamilton et al. (2017b), Reddit is the gold standard benchmark for GNNs on large scale graphs. Flickr and Yelp were introduced in Zeng et al. (2019) and PPI in Zitnik and Leskovec (2017). In agreement with Zeng et al. (2019), we confirm that the performance of a variety of models on the last three datasets is unstable, meaning that large variations in the results are observed for very small changes in architecture and optimization parameters. We hypothesize that this is due to errors in the data, or to unfortunate a priori choices of the training, validation, and test splits. We still report results on these datasets for the purpose of comparing with the work in Zeng et al. (2019). Amongst the considered inductive datasets, Reddit and Flickr are multiclass node-wise classification problems: in the former, the task is to predict communities of online posts based on user comments; in the latter, the task is image categorization based on the description and common properties of online images. Yelp and PPI are multilabel classification problems: the objective of the former is to predict business attributes based on customer reviews, while the later task consists of predicting protein functions from the interactions of human tissue proteins.
While our focus is on large graphs, we also experiment with smaller, but well established transductive datasets to compare SIGN to traditional GNN methods: Amazon-ComputersShchur et al. (2018a), Amazon-PhotosShchur et al. (2018a), and Coauthor-CSShchur et al. (2018a). These datasets are used in the most recent state-of-the-art methods evaluation presented in Klicpera et al. (2019). Following Klicpera et al. (2019), we use 20 training nodes per class; 1500 validation nodes are in Amazon-Photos and Amazon-Computers and 5000 in CoauthorCS. Dataset statistics are reported in Tables 4 and 5.
For all datasets we use an inception convolutional layer with , PReLU activation He et al. (2015) and diffusion operator with . To allow for larger model capacity in the inception module and in computing final model predictions, we replace the single-layer projections performed by parameters and with multiple feedforward layers. Model outputs for multiclass classification problems were normalized via softmax; for the multilabel classification tasks we use element-wise sigmoid functions. Model parameters are found by minimizing the cross-entropy loss via minibatch gradient descent with the Adam optimizer Kingma and Ba (2014) and an early stopping patience of , i.e. we stop training if the validation loss does not decrease for consecutive evaluation phases. In order to limit overfitting, we apply the standard regularization techniques of weight decay and dropout Srivastava et al. (2014)
. Additionally, batch-normalizationIoffe and Szegedy (2015)
has been used in every layer to stabilize training and increase convergence speed. Model hyperparameters (weight decay coefficient, dropout rate, hidden layer sizes, batch size, learning rate, number of feedforward layers in the inception module, number of feedforward layers for the final classification) are optimized on the validation sets using bayesian optimization with a tree parzen estimator surrogate functionBergstra et al. (2011). Table 3 shows the hyperparameter ranges defining the search space.
|Learning Rate||Batch Size||Dropout||Weight Decay||Hidden Dimensions||Inception FF||Classification FF|
|Range||[0.00001, 0.01]||[32, 2048]||[0, 1]||[0, 0.01]||[16, 1000]||[1, 2]||[1, 2]|
|Avg. Degree||Classes||Train / Val / Test|
|232,965||11,606,919||50||602||41(s)||0.66 / 0.10 / 0.24|
|Yelp||716,847||6,977,410||10||300||100(m)||0.75 / 0.10 / 0.15|
|Flickr||89,250||899,756||10||500||7(s)||0.50 / 0.25 / 0.25|
|PPI||14,755||225,270||15||50||121(m)||0.66 / 0.12 / 0.22|
|Avg. Degree||Classes||Label rate|
|ClusterGCN||415.29 5.83||9.23 0.10|
|GraphSAINT||34.29 0.06||3.47 0.03|
|SIGN (Ours)||234.27 3.79||0.17 0.00|
Mean and standard deviation of preprocessing and inference time in seconds on Reddit computed over 10 runs.
Baselines. On the large scale inductive datasets, we compare our method to GCN Kipf and Welling (2017), FastGCN Chen et al. (2018), Stochastic-GCN Chen and Zhu (2018), AS-GCN Huang et al. (2018), GraphSAGE Hamilton et al. (2017b), ClusterGCN Chiang et al. (2019), and GraphSAINT Zeng et al. (2019), which constitute the current state-of-the-art. To make the comparison fair, all models have 2 graph convolutional layers. The results for the baselines are reported from Zeng et al. (2019). On the smaller transductive datasets, we compare to the well established methods GCN Kipf and Welling (2017), GAT Velickovic et al. (2018), JK Xu et al. (2018), GIN Xu et al. (2019), ARMA Bianchi et al. (2019b), and the current state-of-the-art DIGL Klicpera et al. (2019).
Inductive. Table 8 presents the results on the inductive datasets. In line with Zeng et al. (2019), we report the micro-averaged F1 score means and standard deviations computed over 10 runs. Here we match the state-of-the-art accuracy on Reddit, while consistently performing competitively on other datasets.
Runtime. While being comparable in terms of accuracy, our method has the advantage of being significantly faster than other methods for large graphs, both in training and inference. In Figure 4, we plot the validation F1 score on Reddit from the start of the training as a function of runtime. SIGN converges faster than both other methods, while also converging to a much better F1 score than ClusterGCN. We don’t report runtime results for GraphSAGE as it is substantially slower than other methods Chiang et al. (2019).
In Table 6, we report the preprocessing time needed to extract the data and prepare it for training, as well as the inference time on the entire test set for Reddit. For this experiment, we used the authors published code. While being slightly slower than GraphSAINT in the preprocessing phase, SIGN takes a fraction of the time for inference, outperforming GraphSAINT and GraphSAGE by over two orders of magnitude. It is important to note that while our implementation is in Pytorch, the implementations of GraphSAINT222https://github.com/GraphSAINT/GraphSAINT and ClusterGCN333https://github.com/google-research/google-research/tree/master/cluster_gcn
are in Tensorflow, which according toChiang et al. (2019), is up to six times faster than PyTorch. Moreover, while GraphSAINT’s preprocessing is parallelized, ours is not. Aiming at a further performance speedup, a TensorFlow implementation of our model, together with parallelization of the preprocessing routines, is left as future work.
Transductive. To further validate our method, we compare it to classic, as well as state-of-the-art, (non-scalable) GNN methods on well established small scale benchmarks.
Table 7 presents the results on these transductive datasets, averaged over 100 different train/val/test splits. While our focus is on node-wise classification on large scale graphs, we show that SIGN is competitive also on smaller well-established transductive benchmarks, outperforming classical methods and getting close to the current state-of-the-art method (DIGL). This suggests that while being scalable and fast – therefore well-suited to large scale applications – it can also be effective to tackle problems involving modest sized graphs.
Effect of convolution size . We perform a sensitivity analysis on the power parameter r, defining the size of the largest convolutional filter in the inception layer. On Reddit, works best and we keep this configuration on all datasets. Figure 4 depicts the convergence test F1 scores as a function of . It is interesting to see that while the model with is a generalization of the model with , increasing is actually detrimental in this case. We hypothesize this is due to the features aggregated from the 3-hop neighborhood of a node not being informative, but actually misleading for the model.
|GCN Kipf and Welling (2017)||84.75 0.23||92.08 0.20||91.83 0.08|
|GAT Velickovic et al. (2018)||45.37 4.20||53.40 5.49||90.89 0.13|
|JK Xu et al. (2018)||83.33 0.27||91.07 0.26||91.11 0.09|
|GIN Xu et al. (2019)||55.44 0.83||68.34 1.16||-|
|ARMA Bianchi et al. (2019b)||84.36 0.26||91.41 0.22||91.32 0.08|
|DIGL Klicpera et al. (2019)||86.67 0.21||92.93 0.21||92.97 0.07|
|SIGN (Ours)||85.93 1.21||91.72 1.20||91.98 0.50|
|GCN Kipf and Welling (2017)||0.933 0.000||0.492 0.003||0.515 0.006||0.378 0.001|
|FastGCN Chen et al. (2018)||0.924 0.001||0.504 0.001||0.513 0.032||0.265 0.053|
|Stochastic-GCN Chen and Zhu (2018)||0.964 0.001||0.482 0.003||0.963 0.010||0.640 0.002|
|AS-GCN Huang et al. (2018)||0.958 0.001||0.504 0.002||0.687 0.012||-|
|GraphSage Hamilton et al. (2017b)||0.953 0.001||0.501 0.013||0.637 0.006||0.634 0.006|
|ClusterGCN Chiang et al. (2019)||0.954 0.001||0.481 0.005||0.875 0.004||0.609 0.005|
|GraphSAINT Zeng et al. (2019)||0.966 0.001||0.511 0.001||0.981 0.004||0.653 0.003|
|SIGN (Ours)||0.966 0.003||0.503 0.003||0.965 0.002||0.623 0.005|
Our results are consistent with previous reports Shchur et al. (2018b) advocating in favor of simple architectures (with just a single graph convolutional layer) in graph learning tasks. Our architecture achieves a good trade off between simplicity, allowing efficient and scalable applications to very large graphs, and expressiveness achieving competitive performance in a variety of applications. For this reason, SIGN is well suited to industrial large-scale systems. Our architecture achieves competitive results on common graph learning benchmarks, while being significantly faster in training and up to two orders of magnitude faster in inference than other scalable approaches.
Extensions. Though in this paper we applied our model to the supervised node-wise classification setting, it is generic and can also be used for graph-wise classification tasks and unsupervised representation learning (e.g. graph autoencoders). The latter is a particularly important setting in recommender systems Berg et al. (2018).
While we focused our discussion on undirected graphs for the sake of simplicity, our model is straighforwardly applicable to directed graphs, in which case is a non-symmetric diffusion operator. Furthermore, more complex aggregation operations e.g. higher-order Laplacians Barbarossa and Sardellitti (2019) or directed diffusion based on graph motifs Monti et al. (2018) can be straightforwardly incorporated as additional channels in the inception module.
Finally, while our method relies on linear graph aggregation operations of the form for efficient precomputation, it is possible to make the diffusion operator dependent on the node features (and edge features, if available) as a matrix of the form .
Limitations. Graph attention Veselkov and others (2019) and similar mechanisms Monti et al. (2017) require a more elaborate parametric aggregation operator of the form , where are learnable parameters. This precludes efficient precomputation, which is key to the efficiency of our approach. Attention can be implemented in our scheme by training on a small subset of the graph to first determine the attention parameters, then fixing them to precompute the diffusion operator that is used during training and inference. For the same reason, it is easy to do only one graph convolutional layer, though the architecture supports multiple linear layers. Architectures with many convolutional layers are achievable by layer-wise training.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. External Links: Cited by: §4.2.