KerGNNs: Interpretable Graph Neural Networks with Graph Kernels

by   Aosong Feng, et al.
Yale University

Graph kernels are historically the most widely-used technique for graph classification tasks. However, these methods suffer from limited performance because of the hand-crafted combinatorial features of graphs. In recent years, graph neural networks (GNNs) have become the state-of-the-art method in downstream graph-related tasks due to their superior performance. Most GNNs are based on Message Passing Neural Network (MPNN) frameworks. However, recent studies show that MPNNs can not exceed the power of the Weisfeiler-Lehman (WL) algorithm in graph isomorphism test. To address the limitations of existing graph kernel and GNN methods, in this paper, we propose a novel GNN framework, termed Kernel Graph Neural Networks (KerGNNs), which integrates graph kernels into the message passing process of GNNs. Inspired by convolution filters in convolutional neural networks (CNNs), KerGNNs adopt trainable hidden graphs as graph filters which are combined with subgraphs to update node embeddings using graph kernels. In addition, we show that MPNNs can be viewed as special cases of KerGNNs. We apply KerGNNs to multiple graph-related tasks and use cross-validation to make fair comparisons with benchmarks. We show that our method achieves competitive performance compared with existing state-of-the-art methods, demonstrating the potential to increase the representation ability of GNNs. We also show that the trained graph filters in KerGNNs can reveal the local graph structures of the dataset, which significantly improves the model interpretability compared with conventional GNN models.



page 4

page 8

page 9

page 10

page 11

page 12

page 13

page 15


Improving the Expressive Power of Graph Neural Network with Tinhofer Algorithm

In recent years, Graph Neural Network (GNN) has bloomly progressed for i...

Graph Anisotropic Diffusion

Traditional Graph Neural Networks (GNNs) rely on message passing, which ...

Pre-training Graph Neural Networks with Kernels

Many machine learning techniques have been proposed in the last few year...

A Novel Higher-order Weisfeiler-Lehman Graph Convolution

Current GNN architectures use a vertex neighborhood aggregation scheme, ...

Sparsifying the Update Step in Graph Neural Networks

Message-Passing Neural Networks (MPNNs), the most prominent Graph Neural...

Message Passing Graph Kernels

Graph kernels have recently emerged as a promising approach for tackling...

Generalizable Cross-Graph Embedding for GNN-based Congestion Prediction

Presently with technology node scaling, an accurate prediction model at ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related Work


Several works have been devoted to improving the expressivity of GNNs by introducing spatial, hierarchical, and higher-order GNN variants. For example, abu2019mixhop proposed the mix-hop structure which can learn a more general class of neighborhood mixing relationships. sato2019approximation proposed to use Consistent Port Numbering GNN to augment the neighborhood aggregation, but port orderings are not unique and different orderings may lead to different expressivity. klicpera2020directional leveraged the atom coordinate information in the molecular graph to improve the expressivity, but the notion of direction is hard to generalize to more general graphs. nguyen2020graph

used the graph homomorphism numbers as updated embeddings and show the expressivity of such graph classifiers with universality property, which unfortunately lacks neural network structure. Higher-order GNN variants have been studied in

morris2019weisfeiler and maron2019provably, which is more powerful than the 1-WL graph isomorphism test. However, higher-order methods always involve heavy computation and KerGNNs introduce a different way to break this 1-WL limit.

Combination of Graph Kernel and GNNs

Graph kernels and GNNs can be combined in the same framework. Some works apply graph kernels and neural networks at different stages (navarin2018pre; nikolentzos2018kernel). There are also works on using GNN architecture to design new kernels. For example, du2019graph proposed a graph kernel equivalent to infinitely wide GNNs which can be trained using gradient descent.

A different line of research focuses on integrating kernel methods into GNNs. lei2017deriving mapped inputs to RKHS by comparing inputs with reference objects. However, the reference objects they use lack graph structure and may not be able to capture the structural information. chen2020convolutional proposed GCKN which maps the input into a subspace of RKHS using walk and path kernel. While GCKN utilizes the local walk and path only starting from the central node, our model considers any walks (up to a maximal length) within the subgraph around the central node, and can thus explore more topological structures. Another recent work by rwgnn focused on improving model transparency by calculating the graph kernels between trainable hidden graphs and the entire graph. However, the method only supports a single-layer model and lacks theoretical interpretation. Our KerGNN model generalizes their scenario by applying hidden graphs to extract local structural information instead of the entire graph, and therefore constructs a multi-layer structure with better graph classification performance.


Both graph structures and feature information lead to complex GNN models, making it hard for a human-intelligible explanation of the prediction results. Therefore, the transparency and explainability of GNN models are important issues to address. baldassarre2019explainability compared two main classes of explainability methods using infection and solubility problems. pope2019explainability introduced explainability methods for the popular graph convolutional neural networks and demonstrated the extended methods on visual scene graphs and molecular graphs. ying2019gnnexplainer proposed a model-agnostic approach that can identify a compact subgraph that has a crucial role in GNN’s prediction. In addition to visualizing output graphs as in regular GNNs, our KerGNN provides trained hidden graphs as a byproduct of training without additional computations, which contain useful structural information showing the common characteristics of the whole dataset instead of one specific graph, and can be helpful for interpreting the predictions of GNNs.

Background: Graph Kernels

Graph kernels have been proposed to solve the problem of assessing the similarity between graphs, and therefore making it possible to perform classification and regression with graph-structured data. Most graph kernels can be written as the sum of several pair-wise base kernels, following the -convolution framework (haussler1999convolution):


where , are two input graphs with node attributes, and can be any positive definite kernel defined on the node attributes. In this paper, we mainly consider random walk kernel which will be integrated into our proposed model in the next section.

Random walk kernels are one of the most studied graph kernels. They count the number of walks that two graphs have in common, and were initially proposed by gartner2003graph and kashima2003marginalized. Among numerous variations of the random walk kernel, we deploy the -step random walk kernel which compares random walks up to length in two graphs.

Following Equation 1, we can write the base kernel of random walks with length as

where is the coefficient, denotes neighbors of , denotes the length of random walks which we compare in two graphs. If , the random walk kernel is equivalent to the simple node-pair kernel. To efficiently compute the random walk kernels, we follow the generalized framework of computing walk-based kernel (vishwanathan2006fast), and utilize the direct product graph defined as below.

Definition 1 (Direct Product Graph). For two labeled graphs and , the direct product graph is defined as , defined as and .

Performing a random walk on the direct product graph is equivalent to performing the simultaneous random walks on graphs and . The -step random walk kernel can be calculated as


where is the adjacency matrix of and is a sequence of weights. It should be noted that the -th element of (i.e., to the power of ) represents the number of common walks of length between the -th and -th node in .

To generalize the above formula into the continuous and multi-dimensional scenario, we first define the vertex attributes of the direct product graph . Given the node attribute matrix for a graph with nodes and each node attribute is of dimension , the node attribute matrix of the direct product graph is calculated as , where and are the node attribute matrices for and , respectively, and . The -th element of matrix encodes the similarity between the th-node of and the -th node of . We flatten

into vector

for ease of notation, and then integrate the encoded pair-wise similarity into Equation 2


Based on this equation, we can calculate the kernel value between two input graphs using the similarity of common walks as the metric. The details of calculating Equation 3 are included in Appendix.

In practice, we also consider a slight variation of Equation 1 by adding trainable weights to each base kernel term, and we call it deep random walk kernel:


where represents the trainable weight assigned to the base kernel.

Proposed Model

In this section, we first discuss the framework of the proposed KerGNN model. Then we introduce the concept of subgraph-based neighborhood aggregation, and use it to analyze the expressivity of KerGNNs. Next, we show that KerGNNs are inspired by CNNs and compare them from the kernel perspective. Finally we argue that KerGNNs can generalize MPNN architecture and analyze the time complexity.

KerGNN Framework

In this subsection, we introduce the KerGNN model which updates each node’s embedding according to the subgraph centered at this node instead of the rooted subtree patterns in MPNNs, as shown in Figure 1(b). Unless otherwise specified, we refer to the subgraph as the vertex-induced subgraph formed from a node and all its 1-hop neighbors.

We first define the embeddings of nodes and subgraphs, which are mapping functions from graphs to the feature space and from nodes to the feature space.

Definition 2 (Feature mapping). Given a graph , a node feature mapping is a node-wise mapping function , which maps every node to a point in , and is called the feature map for node . A graph feature mapping is a function , where is the set of graphs, and is called the feature map for graph .

For an -layer neural network, we call the input layer the -th layer. At each hidden layer , the input to this layer is an undirected graph , and each node has a feature map . The output of layer is the same graph , because we do not consider graph pooling here, and each node in the output graph has a feature map . For example, can be a graph in the dataset, and is the node attributes with dimension

. For graphs with discrete node labels, the attributes can be represented as the one-hot encodings of labels and the dimension of attributes corresponds to the total number of classes. For graphs without node labels, we use the degree of the node as the node attribute.

Inspired by the filters in CNN, we define a set of graph filters at each KerGNN layer to extract the local structural information around each node in the input graph (see Figure 1(b)).

Definition 3 (Graph filter). The -th graph filter at layer is a graph with nodes. It has a trainable adjacency matrix and node attribute matrix .

At layer , there are graph filters such that the output dimension is also , and each node attribute in the graph filter, represented by each row of , has the same dimension as the node feature map in the input graph.

KerGNN Layer.

Now we consider a single KerGNN layer. We assume the input is a graph-structured dataset with undirected graph , and each node has the attribute . Then the input node feature map is .

Each node in the graph is equipped with a subgraph , and feature maps are transformed to in a way such that neighbors’ local information (topological information and node representations) contained in will be aggregated to the central node . We then rely on the graph filters to obtain . Specifically, we calculate by projecting subgraph feature map into the -th dimension of using the kernel function value between graph filter and subgraph , i.e.,


where we adopt a random walk kernel as , which is introduced in Equation 2. After calculating the kernel value of the subgraph with respect to every graph filter , we obtain every dimension of node ’s feature map , which forms the output of the KerGNN layer.

It should be noted that using graphs and to calculate the kernel value is equivalent to performing inner product of and in an implicit high-dimensional space, and using feature map of instead of the multiset of neighboring nodes (as used in MPNNs) improves expressivity, which is analyzed in the next subsection. Besides, if we use the output space to approximate the high-dimensional space introduced by the kernel method, the updating rule will correspond to the convolutional kernel network proposed by mairal2014convolutional, and we will follow the same idea when we compare KerGNNs with CNNs in the later subsection.

Multiple-layer Model.

Based on the single-layer analysis above, we can construct a multiple-layer KerGNN by stacking KerGNN layers followed by readout layers. Specifically, the input to layer is the graph with node feature map . Layer is parameterized by graph filters . Each graph filter has a trainable adjacency matrix and node attributes . Then the -th dimension of the output feature map for node in can be explicitly calculated as


The forward pass of the th-layer of KerGNNs is summarized in Algorithm 1.

For the graph classification, we then deploy the graph-level readout layer to generate the embedding for the entire graph. We obtain the graph representation at each layer by summing all the nodes’ representations. To leverage information from every layer of the model, we then concatenate the graph representations across all layers:

  Input: Graph ; Input node feature maps ; Graph filters ; Graph kernel function
  Output: Graph ; Output node feature maps
  for   do
     for  to  do
     end for
  end for
Algorithm 1 Forward pass in -th KerGNN layer

Expressivity of Subgraph-based Aggregation

In this subsection, we first define the subgraph-based neighborhood aggregation, and discuss the requirements of the subgraph feature map to achieve higher expressivity than 1-WL algorithm, then we show that KerGNN is one of the models that satisfy these requirements.

To leverage the structural information contained in the subgraph, we aggregate the subgraph information by finding a proper subgraph feature map , and update the node representation of combining the subgraph feature map with ’s own feature map. Formally, we define this aggregation process as follows.

Definition 4 (Subgraph-based aggregation). The graph neural network at layer deploying subgraph-based neighborhood aggregation updates feature mapping according to , where and are update and aggregation functions, respectively.

GNNs distinguish different graphs by mapping them to different embeddings, which resembles the graph isomorphism test. xu2018powerful characterize the representational capacity of MPNNs using the WL graph isomorphism test criterion, and show that MPNNs can be as powerful as 1-WL graph isomorphism test if the node update, aggregation, and graph-level readout function are injective. We follow the similar approach and show in the following that subgraph-based GNNs like KerGNNs can be at least as powerful as the 1-WL graph isomorphism test.

Because we are comparing the model’s expressivity with the 1-WL algorithm which updates node labels based on the multiset of neighboring nodes, to achieve high expressivity, it is natural to think that should have a one-to-one relationship with respect to the multiset of nodes that subgraph contains. We show in Lemma 1 that the graph feature map induced by the random walk kernel satisfies this condition.

Lemma 1

if is the feature map of graph induced by the random walk graph kernel, then is injective with respect to the multiset of all its contained nodes , where denotes the multiset and is the label or attribute of node .

The proof follows directly from the random walk kernel definition in gartner2003graph, and we notice that the graph feature map induced by the WL graph kernel also satisfies this lemma. Based on this injective relationship between multiset and subgraph feature map, we can compare the expressivity of the subgraph-based GNN and 1-WL graph isomorphism test using the following theorem.

Theorem 1

Let be a GNN with a sufficient number of GNN layers, if the following conditions hold at layer :

a) aggregates and updates node features iteratively with , where function and are injective, and is the feature mapping induced by the random walk kernel;

b) ’s graph-level readout, which operates on the multiset of node features , is injective;

then maps any graphs and that 1-WL test decides as non-isomorphic to different embeddings, and there exist graph and that 1-WL test decides as isomorphic, but can be mapped to different embeddings by .

The proof is shown in Appendix. This theorem shows that subgraph-based GNNs can be more expressive than the 1-WL isomorphism test and thus MPNNs. In the KerGNN model, we do not explicitly calculate the subgraph feature map which lives in the high-dimensional space. Instead, we apply the kernel trick and use the subgraph feature map as . Then, the graph kernel function can be seen as a composition of functions and . Therefore, according to Theorem 1, to achieve high representational power, the graph kernel function needs to be injective with respect to the subgraph feature map , and we introduce the following lemma to show that the KerGNN model satisfies this requirement.

Lemma 2 There exists a feature map so that is unique for different .

The proof is shown in Appendix. Besides, as shown in the definition of graph filters, in the KerGNN model we parameterize the node feature and adjacency matrix of graph filter instead of directly parameterizing .

Connections to CNNs

Standard CNN models update the representation of each pixel by convolving filters with the patch centered at it, and in GNNs, a natural analog of the patch in the graph domain is the subgraph. While many MPNNs draw connections with CNNs by extending 2-D convolution to the graph convolution, we show in this subsection that both 2-D image convolution and KerGNN aggregation process can be viewed as applying kernel tricks to the input image or graph, and therefore, the KerGNN model naturally extends the CNN architecture into the graph domain, from a new kernel perspective.

We first show in Appendix that under suitable assumptions, the 2-D image convolution can be viewed as applying kernel functions between input patches and filters. The basic idea is that we can rethink the 2-D convolution as projecting the input image patch into the kernel-induced Hilbert space. The projection is done by performing inner product between the patch and basis vectors, which can be calculated using the kernel trick, and the projected representation in the output space will be the output of the CNN layer.

Then we can extend the same philosophy to the graph domain, by introducing subgraphs and topology-aware graph filters as the counterpart of patches and filters in CNNs, and KerGNN will adopt the kernel trick to project the input subgraph representation into the output space (detailed in Appendix). Based on these two observations, we can see that KerGNNs generalize CNNs into the graph domain by replacing the kernel function for vectors with the graph kernel function, which provides a new insight into designing GNN architecture, different from the spatial and spectral convolution perspectives.

Connections to Existing GNNs

As the subgraph of one node can be a more fruitful source of information than just the multiset of its neighbors, we show in this subsection that KerGNNs can generalize the standard MPNNs. From the point of view of KerGNNs, MPNNs deploy a simple graph filter with one node, and an appropriate kernel function can be chosen within KerGNN framework, such that KerGNNs iteratively update nodes’ representations using neighborhood multiset aggregation like in MPNNs. For example, we show in Appendix that the node update rule of Graph Convolutional Network (GCN) (kipf2016semi) can be treated as using one-node graph filters with properly-defined -convolution graph kernel. Our model generalizes most MPNN structures by deploying more complex graph filters with multiple nodes and learnable adjacency matrix, and using more expressive and efficient graph kernels.

Time Complexity Analysis

Most MPNNs incur a time complexity of , or if the adjacency matrix is sparse containing non-zero entries, because updating the embedding of node involves neighbors, where is the degree of node . In KerGNNs, we apply graph kernel with the subgraph instead of the whole graph, so the computational complexity would be related to the complexity of each subgraph. For the subgraph with nodes and adjacency matrix with non-zero entries, we update the representation of node by calculating the random walk kernel with Equation 11 in Appendix. This calculation takes a computation time of , where is the maximum length of the random walk, and are the node dimensions of the current layer and next layer, is the number of nodes in each graph filter. In an undirected subgraph, represents the number of edges and will be greater than and smaller than . If we sum up the computation time for all the nodes in the entire graph, the time complexity of KerGNNs will range between and worst-case scenario (fully-connected graph) . We experimentally compare the running time of the proposed model with several GNN benchmarks. As shown in Table 3 in Appendix, KerGNNs achieve better or similar running time compared to the fastest benchmark method, and much less running time than higher-order GNNs.


We evaluate the proposed model on graph classification task and node classification task (discussed in Appendix), and we also show the model interpretability by visualizing the graph filters in the trained models as well as the output graphs.

# graphs 1178 4110 1113 600 1000 1500 2000 5000
# classes 2 2 2 6 2 3 2 3
avg. # nodes 284 30 39 33 20 13 430 74
SP 78.73.8 66.32.6 71.96.1 25.05.6 57.55.4 40.52.8 75.52.1 58.41.3
PK 78.03.8 72.32.8 59.70.3 61.06.7 73.94.3 51.15.8 68.52.9 77.32.4
WL-sub 77.53.5 79.53.3 74.83.2 51.25.3 72.54.6 51.55.8 67.24.2 77.52.4
GNTK OOR 83.51.2 75.52.2 48.22.4 75.93.1 52.24.2 OOR OOR
DGCNN 76.64.3 76.41.7 72.93.5 38.95.7 69.23.0 45.63.4 87.82.5 71.21.9
DiffPool 75.03.5 76.91.9 73.73.5 59.55.6 68.43.3 45.63.4 89.11.6 68.92.0
ECC 72.64.1 76.21.4 72.33.4 29.58.2 67.72.8 43.53.1 OOR OOR
GIN 75.32.9 80.01.4 73.34.0 59.64.5 71.23.9 48.53.3 89.91.9 75.62.3
GraphSAGE 72.92.0 76.01.8 73.04.5 58.26.0 68.84.5 47.63.5 84.31.9 73.91.7
RWGNN 77.64.7 73.91.3 74.73.3 57.66.3 70.84.8 48.82.9 90.41.9 71.92.5
GCKN 77.34.0 79.21.2 76.12.8 59.35.6 74.51.2 51.03.9 OOR 74.32.8
1-2-3 GNN OOR 72.72.9 74.55.6 OOR 70.73.4 50.22.2 91.12.1 OOR
Powerful GNN OOR 83.41.8 75.93.3 54.85.5 73.04.9 50.53.2 OOR 75.41.4
KerGNN-1 77.63.7 74.32.2 75.83.5 62.15.5 74.44.3 51.63.1 81.51.9 70.51.6
KerGNN-2 78.93.5 76.32.6 75.54.6 55.05.0 73.74.0 50.95.1 82.02.5 72.72.1
KerGNN-3 75.53.1 80.51.9 76.53.9 54.14.3 72.14.6 50.14.5 82.01.9 71.12.0
KerGNN-2-DRW 77.04.4 82.81.8 76.14.1 59.54.5 71.14.1 50.53.1 89.51.6 75.12.3

Table 1: Test set classification accuracies (%)

. The mean accuracy and standard deviation are reported. Best performances are highlighted in bold. OOR means Out of Resources, either time or GPU memory.

Experiment Settings


We evaluate our proposed KerGNN model on 8 publicly available graph classification datasets. Specifically, we use DD (dobson2003distinguishing), PROTEINS (borgwardt2005protein), NCI1 (schomburg2004brenda), ENZYMES (schomburg2004brenda) for binary and multi-class classification of biological and chemical compounds, and we also use the social datasets IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, and COLLAB (yanardag2015deep).


To make a fair comparison with state-of-the-art GNNs, we follow the cross-validation procedure described in errica2019fair. We use a 10-fold cross-validation for model assessment and an inner holdout technique with a 90%/10% training/validation split for model selection, following the same dataset index splits as errica2019fair

. Besides, we use Adam optimizer with an initial learning rate of 0.01 and decay the learning rate by half in every 50 epochs. For the four social datasets, we use node degrees as the input attributes for each node, and for the four bio/chemical datasets, we use node labels or attributes as the input feature for each node.


The hyper-parameters that we tune for each dataset include the learning rate, the dropout rate, the number of layers of KerGNNs and MLP, the number of graph filters at each layer, the number of nodes in each graph filter, the number of nodes for each subgraph, and the hidden dimension of each KerGNN layer. For the random walk kernel, we also tune the length of random walks.

Baseline Models.

We consider the KerGNN model with single and multiple KerGNN layers, namely KerGNN-, corresponding to KerGNN model with layers, and KerGNN--DRW representing the model deploying the deep random walk kernel. We also compare our models with widely-used GNNs: DGCNN (zhang2018end), DiffPool (ying2018hierarchical), ECC (simonovsky2017dynamic), GIN (xu2018powerful), GraphSAGE (hamilton2017inductive), RWGNN (rwgnn), GCKN (chen2020convolutional), and two high-order GNNs: 1-2-3 GNN (morris2019weisfeiler) and Powerful GNN (maron2019provably). Part of the results for these baseline GNNs are taken from errica2019fair, and we run GCKN, 1-2-3 GNN and Powerful GNN using the official implementations. In addition, we also compare the proposed KerGNN model with three popular GNN-unrelated graph kernels: shortest path (SP) kernel (borgwardt2005shortest), propagation (PK) kernel (neumann2016propagation), the Weisfeiler-Lehman subtree (WL-sub) kernel (shervashidze2011weisfeiler) and GNN-related GNTK (du2019graph). We use the GraKeL library (JMLR:v21:18-370) to implement these graph kernels and run GNTK using the official implementation.


The graph classification results are shown in Table 1, with the best results highlighted in bold. We can see that the proposed models achieve superior performance than conventional GNNs with 1-WL limits, and achieve similar performance compared with high-order GNNs, with less running time. The single-layer KerGNN model performs well on small graphs like IMDB social datasets. For larger graphs, deeper models with more layers or with deep random walk kernel perform better. We show more experimental results, model parameter studies, and node classification results in Appendix. The optimal parameters of the graph filter are different for different datasets, depending on the local structures of different types of graphs, e.g., the star patterns in graphs of REDDIT-B and the ring and chain patterns in graphs of NCI1.

Model Interpretability

Figure 2: Model visualization. Input graphs are drawn from (a) MUTAG and (b) REDDIT-B datasets with different node shapes corresponding to different atom types. In both graph filters and output graphs, node color represents relative attribute value.

Visualizing the filters in CNNs gives insights into what features CNNs focus on. Following the same idea, we can also visualize the trained graph filters, which indicate some key structures of the input dataset. We visualize the graph filters trained with MUTAG 111We use the MUTAG dataset for visualization due to its easily interpretable structure. However, we do not use this dataset in cross-validation because its number of graphs is too small. (KKMMN2016) and REDDIT-B dataset in Figure 2. The MUTAG dataset consists of 188 chemical compounds divided into two classes according to their mutagenic effect on a bacterium. As shown in the input graphs in Figure 2(a), most of the MUTAG graphs in the dataset consist of ring structures with 6 carbon atoms.

KerGNNs and other standard GNNs can generate output graphs with updated node attributes, and we can extract important nodes for the classification tasks using relative attribute values, which is shown in output graphs in Figure 2. We can make several observations from the output MUTAG chemical structures: 1) The carbon atoms at the connection points of rings are more important than those connected with atom groups, which are more important than those at the remaining positions. 2) The atoms in the atom group are always less important than those carbon atoms in the carbon ring.

Compared to standard GNN variants, KerGNNs have graph filters as extra information to help explain the predictions of the model. To visualize the graph filters, we extract the adjacency matrix and the attribute matrix for each graph filter from the trained KerGNN layer. We then adopt the ReLU functions to prune the unimportant edges. In Figure 2, we use different sizes of nodes to denote the relative importance of nodes. For the MUTAG dataset, we can see most of the graph filters have ring structures, similar to the carbon rings at the input graphs, and some graph filters have small connected rings, similar to the concatenated carbon rings. It should be noted that the number of nodes in the rings of graph filters may not be equal to 6 because we limit the total number of nodes to be 8. KerGNN layers utilize these rings in the graph filter to match against the local structural patterns (e.g., carbon rings) in the input graphs, using the graph kernels. This indicates the importance of carbon rings in the mutagenic effect, which also corresponds to our observations in the output graphs.


In this paper, we have proposed Kernel Graph Neural Networks (KerGNNs), a new graph neural network framework that is not restricted to the theoretical limits of the message passing aggregation. KerGNNs are inspired by several characteristics of CNNs and can be seen as a natural extension of CNNs in the graph domain, from the viewpoint of the kernel methods. KerGNNs achieve competitive performance on a variety of datasets compared with several GNNs and graph kernels, and can provide improved explainability and transparency by visualizing the graph filters and output graphs.


This work was partially supported by the U.S. Office of Naval Research under Grant N00173-21-1-G006, the U.S. National Science Foundation AI Institute Athena under Grant CNS-2112562, and the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF-16-3-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of U.S. Office of Naval Research, the U.S. National Science Foundation, the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.


Appendix A Calculation of Random Walk Kernel

To calculate the random walk kernel between two graphs, we have to calculate Equation 3 in the main paper:


where , denotes the Kronecker product between two matrices, , and denotes flattening a matrix into a vector by stacking all the columns. Besides . In the following, we assume and are undirected graphs, which is the case for all the datasets we use in the experiments. According to the properties of Kronecker product




we can calculate Equation 8 as


where means Hadamard (element-wise) product, In the KerGNN model, and represent the graph filter and the input graph, respectively, and we can use Equation 11 to avoid calculating the direct product graph.

Appendix B Proofs

Proof of Theorem 1

Let be a graph neural network which satisfies condition a) and b). We first prove the first part of conclusions that can map any graphs and that 1-WL test decides as non-isomorphic to different embeddings. Suppose starting from iteration , the 1-WL test decides and are non-isomorphic (before that, 1-WL test cannot distinguish two graphs), but graph neural network maps them to the same embeddings . This indicates that and always have the same labels for iteration and for any in the 1-WL test. Next we hope to reach a contradiction to this statement. To find this contradiction, we first show on graph or , if node features in the graph neural network , we always have 1-WL node labels for any iteration . This apparently holds for because 1-WL and graph neural network start with the same node features. Suppose this holds for iteration , if for any and , , then according to the node update rule in Theorem 1, we can get


Because and are both injective, we then obtain


According to Lemma 1, if , then , and because , we can get


By our assumption at iteration , we must have


Because the mapping in 1-WL test is injective with respect to the node label and the multiset of neighborhood labels, we get . By induction, if node features in the graph neural network , we always have 1-WL node labels for any iteration . This creates a valid mapping such that for any node in the graph.

Because 1-WL decides graphs and as non-isomorphic, which means , at layer . With the mapping between and , we can get . Because the graph-readout function of graph neural network is injective according to Theorem 1, we should get , which contradicts our assumption.

For the second part of the conclusion that there exist graph and that are decided as isomorphic by 1-WL test but non-isomorphic by subgraph-based graph neural network , to prove it, we can just find an example that satisfies this. The example shown in Figure 1(a) cannot be distinguished by 1-WL graph isomorphism test, but for the subgraph associated with each node, the random walk graph kernel can embed them to different embeddings, by interacting with an appropriate graph filter (e.g., a graph filter with one node).

Proof of Lemma 2

To find at least one feasible , assume the length of non-zero vector is large but finite with maximum absolute value , then we can encode each value of with the base according to their positions in the vector. Specifically, we can let , and the inner product will be injective with respect to .

Appendix C Connections between KerGNNs and CNNs

In this section, we discuss the claim in the paper that KerGNNs generalize CNNs into the graph domain from the kernel’s point of view. We show in the first subsection that 2-D image convolution in CNNs is equivalent to calculating the appropriate kernel function between patches and filters. Then we show in the second subsection that KerGNNs generalize this aggregation approach by introducing the counterparts of patch, filter and convolution in the graph regime.

Rethinking CNNs from the Kernel’s Perspective

Standard CNN models use image convolution to aggregate the local information around each pixel. In this subsection, we show that the image convolution can be viewed as applying kernel functions between input patches and filters in the convolutional layers.

Given a convolutional layer in CNN, the input to the layer is an image , where is a set of pixel coordinates. Typically, is a two-dimensional grid. We also define a feature mapping function to map every pixel in the image to a finite vector space,  : , where is the dimension of the input feature space. For each pixel with coordinate , we can find a neighborhood patch centered at the pixel , with patch size . is the feature map of the pixel in , and with some abuse of notation, is defined as the concatenation of feature maps of every pixel in the patch , i.e., for every pixel . Thus, lives in the vector space . For example, given an RGB image as the input, is in the Euclidean space , and is the feature vector in .

Next, we represent the output of the layer as a different image with the feature mapping function : , where is the dimension of the output feature space. As shown in Figure 3(a), and may not have the same size, but every pixel corresponds to a pixel with an associated patch . The goal of the convolutional layer is then to learn this output feature mapping function . Specifically, the convolutional layer adopts filters to perform 2-D convolution operation over the image. The

-th filter performs the dot product over each patch of the image with a fixed stride, and can be parameterized as a vector

. The output feature is obtained by computing the dot product of and and followed by an element-wise nonlinear function . In other words, it is the -th dimension of the feature representation of the pixel in the new image . The process can be written as follows:


where represents the image convolution, and

are tensors with shape

reshaped from vectors and , respectively. We omitted the bias term here for simplicity.

Now, we rethink this process from the kernel perspective. According to the theory of kernel methods, each positive definite kernel function implicitly defines a RKHS . Next, we try to make this implicit RKHS to be the output feature space of this convolutional layer, by appropriately designing the associated kernel function. Here, we can define a simple dot-product RBF kernel function between the input feature map vector and the filter vector as


It is worth noting that RKHS determined by this RBF kernel is of infinite dimensions, and thus we need to make the assumption here that can be approximated by a finite output vector space, with a finite set of basis vectors which are defined as trainable filters and . This is similar to the assumption made by (mairal2014convolutional). It is noted that the output feature space can also be approximated by a subspace of using Nyström method as in mairal2016end.

Then we can derive the output feature map of the pixel by projecting the input feature map vector into , and thus the -th dimension of the output feature map can be computed as the projection of onto the -th basis vector as


where the second equality holds because of the kernel method. Then we obtain the formula to calculate the output feature mapping function from a perspective that is different from image convolution, and the result is similar to Equation 16

. The only difference is in the element-wise nonlinear activation function, and it is worth noticing that the exponential nonlinearity induced by the RBF kernel is very similar to the popular

ReLu function used in practice.

Now we come to the conclusion that under suitable assumptions, the standard image convolution in CNN layers can be approximately interpreted as applying kernel functions between the input patches and the trainable filters, and we can use the computed kernel value to update the feature map of the pixel in the output space. This is our major motivation for the design of KerGNN. In KerGNN, we use the computed kernel values to update the feature map of the node in the output graph. This comparison of convolutional layers and KerGNN layers is shown in Figure 3.

Figure 3: Comparisons of filters in CNN and KerGNNs. (a) The yellow grids represent the input image, and the pixels in the red box represent the patch around pixel . The blue grids represent the filter. The position of pixel in the output grid is denoted by . (b) The graph denoted in the yellow shadow represents the subgraph of node . The graph with blue shadow represents the graph filter.

Generalizing CNNs to KerGNNs

We explain in this subsection how we derive the update rule shown in Algorithm 1 inspired by CNNs. We consider a single KerGNN layer with an undirected graph and node feature map of the input graph can be characterised as as input.

We rely on the graph filters (as the counterpart of filters) to aggregation each node’s subgraph (as the counterpart of the patch) and obtain . Specifically, we consider using the -convolution typed kernel function which implicitly defines an RKHS . Similar to the analysis of CNN from the kernel perspective (as discussed in the first subsection), we assume that can be approximated by the finite-dimensional space with a set of vectors , so that , Then, we calculate by projecting subgraph feature map into , and the -th dimension of can be calculated as the inner product between and the basis vector , i.e.,


where we adopt random walk kernel as . Similar to our conclusions for CNNs, Equation 19 indicates that we can use the kernel values computed by the subgraph of a node and graph filters as the output feature map of the node. After calculating the kernel value of the subgraph with respect to every graph filter , we obtain every dimension of the feature map of node in , which is the output of the KerGNN layer.

We compare CNNs and KerGNNs in Figure 3, and summarize the similarities in the following:

Sliding over Inputs. In CNN, the filter is systematically applied with each filter-sized patch of the input image, from left to right and top to bottom, with a specified stride. In KerGNNs, we sample a subgraph consisting of and its -hop neighbors, and the graph filter is applied to for (the adjacency of is preserved in the subgraph). It should be noted that the operations defined below do not require the subgraph and graph filter have the same number of nodes or topology.

Shared Parameters. In CNN, all the patches share the same filter for convolution to reduce the number of parameters. In KerGNNs, all the subgraphs for also share the same graph filter, and thus the parameters of graph filters will not scale up as the input graph becomes larger.

Local Aggregation. In CNN, the patch consists of a central pixel and neighboring pixels, and the feature map of the patch contains all the neighbors’ information. The filters aggregate the patch and assign the corresponding kernel value to the output feature map of the central pixel. In KerGNNs, we use the graph filters to aggregate the feature map of the subgraph , and let the kernel value be the output feature map of the central node.

Appendix D Connections between KerGNNs and MPNNs

We show in this section that KerGNNs generalize the standard MPNNs. From the point view of KerGNNs, MPNNs deploy a simple graph filter with one node, and an appropriate kernel function can be chosen with KerGNN framework, such that KerGNNs can iteratively update nodes’ representations using neighborhood aggregation like in MPNNs. For example, the vertex update rule of Graph Convolutional Network (GCN) (kipf2016semi) can be written as


where represents all the neighboring nodes of , is the element-wise nonlinear function, and . To interpret this updating rule in the KerGNN framework, we can define the subgraph containing nodes . We also define graph filters , and each graph filter is defined as . The attribute of the node is parameterized by . Then we define an -convolution graph kernel as


and the node-wise kernel is defined as


Using the KerGNN framework, the -th dimension of the output feature map can be written as the kernel function between and :


which is equivalent to Equation 20. Therefore, the message aggregation in most MPNNs can be treated as using one-node graph filters in KerGNNs, and our proposed method generalizes MPNNs by deploying more complex graph filters with multiple nodes and learnable adjacency matrix.

Appendix E Model Implementation Details

To construct the subgraph for a node, we first calculate all the

-hop neighbors of the node, and extract the subgraph determined by the node and all its neighbors. In implementation, to be compatible with matrix multiplication, we set a maximum size of subgraphs. Any subgraph exceeding this limit is truncated and preserves nearer neighbors. The adjacency matrix of a subgraph that does not reach this limit will be padded with zeros. Therefore, the adjacency matrices of all the subgraphs have the same size.

For the first layer of the KerGNN model, we optionally add an additional linear mapping to transform node attributes of the input graph to a specified dimension (the dimension of node attributes of graph filter in the first layer), such that we can calculate the graph kernel between input graphs and graph filters at the first layer.

For simplicity, we define all the graph filters at the same layer to have the same number of nodes. Besides, for comparison, we only use one type of graph kernels within one model, although different graph filters can interact with same subgraphs with different types of graph kernels and thus different graph kernels can be mixed within one model.

Appendix F Experiment Details and More Results

Graph Classification Task

Hyper-parameter Search.

We conduct the experiment using Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz CPU with NVIDIA GPU (GeForce RTX 2070). We use grid search to select the hyper-parameters for each model during cross-validation, and the hyper-parameter search range is shown in Table 2.

hidden dimension of the first layer [8; 16; 32; 64]
number of graph filter [16; 32; 64; 128]
number of nodes of graph filter [2; 4; 6; 8; 10; 12; 14; 16; 18; 20]
maximum number of nodes for subgraph [5; 10; 15; 20; 25; 30]
-hop neighborhood [1; 2; 3]
maximum step for random walk [1; 2; 3; 4; 5]
hidden dimension of linear layer [8; 16; 32; 48; 64]
dropout rate [0.2; 0.4; 0.6; 0.8]
Table 2: Hyper-parameter search range.

Details of Datasets.

We use 5 bioinformatics datasets: MUTAG is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds with 7 discrete labels. PROTEINS dataset uses secondary structure elements (SSEs) as nodes and two nodes are connected if they are neighbors in the amino-acid sequence or in 3D space. It has 3 discrete labels, representing helix, sheet, or turn. NCI1 is a dataset made publicly available by the National Cancer Institute (NCI) and is a subset of balanced datasets of chemical compounds screened for ability to suppress or inhibit the growth of a panel of human tumor cell lines. DD is a dataset of 1178 protein X-Ray structures. Two nodes in a protein are connected by an edge if they are less than 6 Angstroms apart. The prediction task is to classify the protein structures into enzymes and non-enzymes.

We use 4 social datasets: IMDB-BINARY and IMDB-MULTI are movie collaboration datasets. Each graph corresponds to an ego-network for each actor/actress, where nodes correspond to actors/actresses and an edge is drawn between two actors/actresses if they appear in the same movie. Each graph is derived from a pre-specified genre of movies, and the task is to classify the genre that the graph is derived from. IMDB-BINARY considers two genres: Action and Romance. IMDB-MULTI considers three classes: Comedy, Romance, and Sci-Fi. Each graph of REDDIT-BINARY dataset corresponds to an online discussion thread and nodes correspond to users. Two nodes are connected if at least one of them responded to another’s comment. The task is to classify each graph to a community or a subreddit it belongs to. COLLAB is a scientific collaboration dataset, derived from 3 public collaboration datasets, namely, High Energy Physics, Condensed Matter Physics, and Astro Physics. Each graph corresponds to an ego-network of different researchers from each field. The task is to classify each graph to a field the corresponding researcher belongs to.

Figure 4: Test accuracy w.r.t. model parameters. The results are obtained from experiments on 1 fold of datasets, and for all the three graphs only the model parameter on the x-axis is changed with remaining model parameters fixed.

Model Parameters.

We study how the test accuracy is influenced by the number of nodes in the graph filters, which determines the size and complexity of the graph filters. As shown in Figure 4(a), the optimal size of the graph filter is different for different datasets, depending on the local structures of different types of graphs, e.g., the star patterns in graphs of REDDIT-B and the ring and chain patterns in graphs of NCI1. We also study the influence of the maximum length of random walks as shown in Figure 4

(b), it can be seen that longer walks generally benefit the classification results except for ENZYMES dataset where the model performs the best with walk of length 2. In the code implementation, we specify the maximum number of nodes that the subgraph can contain, which implicitly control the size of the subgraphs, and thus we also study the influence of this threshold of node numbers in Figure

4(c). DD and ENZYMES achieve higher accuracy with larger subgraphs, because larger subgraph contains more fruitful neighborhood topology information. The remaining datasets are not influenced too much, because we fix the size of graph filters, and the model performance degrades when the subgraph size and graph filter size mismatch, for example, small graph filters cannot handle larger subgraphs.

Model Running Time.

We compare the running time of the proposed model with several GNN benchmarks, as shown in Table 3. We measure the one-epoch running time averaged over 50 epochs, using the same GPU card. All the models are set to have hidden dimension 32 and 1 MLP layer. For KerGNNs, the number of nodes in the graph filter is set to 6 and the maximum subgraph size is set to 10. We observe that KerGNN gives much better running time compared with high-order GNN benchmarks, and achieves similar performance compared with conventional GNN models with 1-WL constraints.

DGCNN 0.2710.005 0.4250.009 0.1340.006 0.0690.002 0.1130.007 0.1610.008 0.4110.010 0.8240.053
DiffPool 2.8870.029 4.2320.035 1.1250.006 0.1310.007 0.2100.031 0.2950.013 9.5800.233 2.2960.088
ECC 110.14 4.31 1.1790.012 0.5500.012 0.2330.005 0.2540.007 0.3040.050 OOR OOR
GIN 0.3440.003 0.4430.007 0.1330.006 0.0720.004 0.1180.032 0.1600.008 0.7400.053 0.8090.044
GraphSAGE 0.1410.002 0.3310.014 0.0880.003 0.0470.002 0.0810.004 0.1220.009 0.1980.011 0.5890.026
1-2-3 GNN OOR 1.2280.075 1.0560.058 OOR 1.3010.071 1.7540.075 0.6080.025 222This result is obtained from 1-GNN model. OOR
Powerful GNN OOR 19.7640.270 4.240.23 1.9970.150 2.7010.084 3.4170.092 OOR 17.5780.621
KerGNN-1 0.7320.041 0.4450.033 0.1830.034 0.3700.030 0.0920.039 0.0860.026 0.8650.051 1.3360.031
KerGNN-2 1.4860.078 0.6880.038 0.3730.043 0.0850.019 0.1480.044 0.2250.036 1.6770.051 2.6340.047
KerGNN-3 2.7480.078 0.8760.044 0.5530.026 0.1260.024 0.1960.019 0.3430.030 2.5510.065 3.9440.046
KerGNN-DRW 1.7820.051 0.7530.034 0.4050.045 0.0900.025 0.1550.036 0.2400.036 2.0780.060 2.6480.039
Table 3: The measured one-epoch running time (s) of different GNN models.

Node Classification Task

We further evaluate the proposed KerGNN model for node classification task. we use 4 datasets: Cora, Citeseer, Pubmed

(sen2008collective), Chameleon (rozemberczki2021multi). For each dataset, we randomly split nodes of each class into 60%, 20%, and 20% for training, validation and testing, and report the mean accuracy of all models on the test sets over 10 random splits. We compare our model with several popular GNNs including GCN, GAT (velivckovic2017graph), GEOM-GCN (pei2020geom), APPNP (klicpera2018predict), JKNet (xu2018representation) and GCNII (chen2020simple). As shown in Table4, KerGNNs achieve similar or better results compared with SOTA baselines.

Cora Citeseer Pubmed Chameleon
GCN 85.77 73.68 88.13 28.18
GAT 86.37 74.32 87.62 42.93
Geom-GCN 85.27 77.99 90.05 60.90
APPNP 87.87 76.53 89.40 54.30
JKNet 87.46 76.83 89.18 62.08
GCNII 88.49 77.08 89.57 60.61
KerGNN 87.96 76.61 89.53 62.28
Table 4: Node classification accuracy.

Appendix G More Visualizations

In this section, we show the graph filters trained for MUTAG dataset and REDDIT dataset in Figures 5 and 6. We can see that the trained graph filters have different patterns for the two datasets, and each type of patterns reveals the characteristics of corresponding dataset. Specifically, graph filters for MUTAG tend to have ring and circular patterns, while graph filters for REDDIT tend to have star patterns.

Figure 5: Graph filters of MUTAG
Figure 6: Graph filters of REDDIT-BINARY