An End-to-End Graph Convolutional Kernel Support Vector Machine

02/29/2020
by   Padraig Corcoran, et al.
34

A novel kernel-based support vector machine (SVM) for graph classification is proposed. The SVM feature space mapping consists of a sequence of graph convolutional layers, which generates a vector space representation for each vertex, followed by a pooling layer which generates a reproducing kernel Hilbert space (RKHS) representation for the graph. The use of a RKHS offers the ability to implicitly operate in this space using a kernel function without the computational complexity of explicitly mapping into it. The proposed model is trained in a supervised end-to-end manner whereby the convolutional layers, the kernel function and SVM parameters are jointly optimized with respect to a regularized classification loss. This approach is distinct from existing kernel-based graph classification models which instead either use feature engineering or unsupervised learning to define the kernel function. Experimental results demonstrate that the proposed model outperforms existing deep learning baseline models on a number of datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/04/2010

Asymptotic Normality of Support Vector Machine Variants and Other Regularized Kernel Methods

In nonparametric classification and regression problems, regularized ker...
05/15/2019

Function Space Pooling For Graph Convolutional Networks

Convolutional layers in graph neural networks are a fundamental type of ...
06/21/2019

Quantum-Inspired Support Vector Machine

Support vector machine (SVM) is a particularly powerful and flexible sup...
01/02/2020

Kernelized Support Tensor Train Machines

Tensor, a multi-dimensional data structure, has been exploited recently ...
09/16/2020

m-arcsinh: An Efficient and Reliable Function for SVM and MLP in scikit-learn

This paper describes the 'm-arcsinh', a modified ('m-') version of the i...
08/04/2020

Cross-Global Attention Graph Kernel Network Prediction of Drug Prescription

We present an end-to-end, interpretable, deep-learning architecture to l...
01/07/2016

Fast Kronecker product kernel methods via generalized vec trick

Kronecker product kernel provides the standard approach in the kernel me...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The world contains much implicit structure which can be modelled using a graph. For example, an image can be modelled as a graph where objects (e.g. person, chair) are modelled as vertices and their pairwise relationships (e.g. sitting) are modelled as edges [22]

. This representation has led to useful solutions for many vision problems including image captioning and visual question answering

[2]. Similarly, a street network can be modelled as a graph where locations are modelled as vertices and street segments are modelled as edges. This representation has led to useful solutions for many transportation problems including the placement of electrical vehicle charging stations [8].

Given the ubiquity of problems which can be modelled in terms of graphs, performing machine learning on graphs represents an area of great research interest. Advances in the application of deep learning or neural networks to sequence spaces in the context of natural language processing and fixed dimensional vector spaces in the context of computer vision has led to much interest in applying deep learning to graphs. There exist many types of machine learning tasks one may wish to perform on graphs. These include vertex classification, graph classification, graph generation

[46] and learning implicit/hidden structures [7]. In this work we focus on the task of graph classification. Examples of graph classification tasks include human activity recognition where human pose is modelled using a skeleton graph [42]

, visual scene understanding where the scene is modelled using a

scene graph [39] and semantic segmentation of three dimensional point clouds where the point cloud is modelled as a graph of geometrically homogeneous elements [23].

Graph convolutional is the most commonly used deep learning architecture applied to graphs. This architecture consists of a sequence of convolutional layers where each layer iteratively updates a vector space representation of each vertex. In their seminal work, Gilmer et al. [9] demonstrated that many different convolutional layers can be formulated in terms of a framework containing two steps. In the first step, message passing is performed where each vertex receives messages from adjacent vertices regarding their current representation. In the second step, each vertex performs an update of its representation which is a function of its current representation and the messages it received in the previous step. In order to perform graph classification given a sequence of convolutional layers, the set of vertex representations output from this sequence must be integrated to form a graph representation. This graph representation can subsequently be used to predict a corresponding class label. We refer to this task of integrating vertex representations as vertex pooling and it represents the focus of this article. Note that, Gilmer et al. [9] refers to this task as readout.

Performing vertex pooling is made challenging by the fact that different sets of vertex representations corresponding to different graphs may contain different numbers of elements. Furthermore, the elements in a given set are unordered. Therefore one cannot directly apply a feed-forward or recurrent architecture because these require an input lying in a vector space or sequence space respectively. To overcome this challenge most solutions involve mapping the sets of vertex representations to either a vector or sequence space which can then form the input to a feed-forward or recurrent architecture respectively. There exists a wide array of such solutions ranging from computing simply summary statistics such as mean vertex representation to more complex clustering based methods [44].

In this article we propose a novel binary graph classification method which performs vertex pooling by mapping a set of vertex representations to an element in a reproducing kernel Hilbert space (RKHS). A RKHS is a function space for which there exists a corresponding kernel function equalling the dot product in this space. Being a function space where the domain of functions in this space is a Euclidean Space, the RKHS in question is of infinite dimension and in turn has high model capacity. However, the infinite nature of this space makes it challenging to work directly in this space. To overcome this challenge, we use the corresponding kernel function which allows us to implicitly compute the dot product in this space without explicitly mapping to the space in question. This is a commonly used strategy known as the kernel trick. More specifically, the kernel corresponding to the RKHS is used within a support vector machine (SVM) to perform binary graph classification. A useful feature of the proposed pooling method is that the mapping to a RKHS is parametrized by a scale parameter which controls the degree to which different sets of vertex representations can be discriminated.

The proposed graph classification model is trained in a supervised end-to-end manner where the convolutional layers, the kernel function and SVM parameters are jointly optimized with respect to a regularized classification loss. This approach is distinct from existing kernel-based models which instead use feature engineering or unsupervised learning to define the kernel function and only optimize the parameters of the classification method in a supervised manner [43]. Using feature engineering can result in diagonal dominance whereby a graph is determined to only be similar to itself, but not to any other graph [43]. Although unsupervised learning can overcome this problem and improve performance, the kernel may not be optimal for the task at hand given it was learned in an unsupervised as opposed to supervised manner [15]. The proposed solution of optimizing in an end-to-end manner overcomes these limitations.

The remainder of this paper is structured as follows. Section 2 reviews related work on graph kernels and vertex pooling methods. Section 3 describes the proposed graph classification model. Section 4 presents an evaluation of this model through comparison to 12 baseline models on 4 datasets. Finally, section 5 draws some conclusions from this work and discusses possible future research directions.

2 Related Work

In this work we propose a novel vertex pooling method which performs vertex pooling by mapping to a RKHS. In the following two sections we review related work on vertex pooling methods and graph kernels.

2.1 Vertex Pooling

As discussed in the introduction to this article, existing vertex pooling methods generally map the set of vertex representations to a fixed dimensional vector space or sequence space. The simplest methods for performing vertex pooling compute a summary statistic of the set of vertex representations. Commonly used summary statistics include mean, max and sum [4]. Despite the simple nature of these methods, a recent study by Luzhnica et al. [25] demonstrated that in some cases they can outperform more complex methods. Zhang et al. [47] proposed a vertex pooling method which first performs a sorting of vertex representations based on the Weisfeiler-Lehman graph isomorphism algorithm. A subset of these vertex representations are then selected based on this ranking, where the size of this subset is a user specified parameter. Tarlow et al. [24] proposed a vertex pooling method which outputs an element in sequence space. Gilmer et al. [9] proposed to perform vertex pooling by applying the set2set model from Vinyals et al. [36]. The set2set model maps the set of vertex representations to fixed dimensional vector space representation which is invariant to the order of elements in the set. Ying et al. [44] proposed a vertex pooling method which uses clustering to iteratively integrate vertex representations and outputs an element in a fixed dimensional vector space. Kearnes et al. [16] proposed a vertex pooling method which creates a fuzzy histogram of the vertex representations and outputs an element in a fixed dimensional vector space.

2.2 Graph Kernels

As described in the introduction to this article, existing kernel-based graph classification methods use either feature engineering or unsupervised learning to define the kernel. We now review each of these approaches in turn.

The most common approach for feature engineering kernels is the -convolution framework where the kernel function of two graphs is defined in terms of the similarity of their respective substructures [13]. This framework is similar to the bag-of-words framework used in natural language processing. Substructures used in the -convolution framework to define kernels include graphlets [33], shortest path properties [1] and random walk properties [34].

The Weisfeiler-Lehman framework is a framework for feature engineering kernels which is inspired by the Weisfeiler-Lehman test of graph isomorphism. In this framework the vertex representations of a given graph are iteratively updated in a similar manner to graph convolution to give a sequence of graphs. A kernel is then defined with respect to this sequence by summing the application of a given kernel, known as the base kernel, to each graph in the sequence. Shervashidze et al. [32] proposed a family of kernels using this framework by considering a set of base kernels including one which measures the similarity of shortest path properties. Rieck et al. [30] proposed a kernel using this framework by considering a base kernel which measures the similarity of topological properties.

Kriege et al. [21] proposed another framework for feature engineering kernels known as assignment kernels which computes an optimal assignment between graph substructures and sums over a kernel applied to each correspondence in the assignment. The authors proposed a number of kernels using this framework including one based on the Weisfeiler-Lehman graph isomorphism algorithm. Kondor et al. [20] proposed a multiscale kernel which considers vertex features plus topological information through the graph Laplacian. Zhang et al. [48]

proposed a kernel-based on the return probabilities of random walks. The authors used an approximation of the kernel function so that the method can be applied to large datasets

[29].

To overcome the limitations feature engineering and improve performance, recent works in the field of graph kernels have considered unsupervised learning techniques. These methods generally learn a graph representation in an unsupervised manner and subsequently use this representation to define a kernel. Yanardag et al. [43] proposed a kernel which uses the -convolution framework to define a set of substructures and subsequently learns an embedding of these substructures in an unsupervised manner using a word2vec type model. Ivanov et al. [15] proposed a kernel which determines two graphs to be similar if their vertices have similar neighbourhoods measured in terms of anonymous walks which are a generalization of random walks. Learning is performed in an unsupervised manner using a word2vec type model. Nikolentzos et al. [26] proposed a graph kernel which first computes sets of vertex representations corresponding to the graphs in questions in an unsupervised manner. The similarity of these sets are then computed using the earth mover’s distance. The authors noted that these similarities do not yield a positive semidefinite kernel matrix preventing it from being used in some kernel-based classification methods. To overcome this issue the authors use a version of the support vector machine for indefinite kernel matrices. Similar to Nikolentzos et al. [26], Wu et al. [37] proposed a graph kernel which first computes sets of vertex representations corresponding to the graphs in questions in an unsupervised manner. The resulting set of embeddings are in turn used to embed the graph in question by measuring the disturbance distance to sets of embeddings corresponding to random graphs. Finally, this graph representation is used to define a kernel.

3 Methodology

The proposed graph classification model consists of the following three steps. In the first step, a sequence of graph convolutional layers are applied to the graph in question to generate a corresponding set of vertex representations. In the second step, this set of vertex representations is mapped to a RKHS. In the final step, graph classification is performed using a SVM. Each of these three steps are described in turn in the first three subsections of this section. In the final subsection we describe how the parameters of each step are optimized jointly in an end-to-end manner. Before that, we first introduce some notation and formally define the problem of graph classification.

A graph is a tuple where is a set of vertices and is a set of edges. Let denote the space of graphs. Let denote a vertex labelling function. In this work we assume that is a finite set. Let denote a set of graphs and denote a corresponding set of graph labels. In this work we assume that graph labels take elements in the set . In this work we consider the problem of binary graph classification where given and we wish to learn a map .

3.1 Graph Convolution Layers

A large number of different graph convolutional layers have been proposed. Broadly speaking a graph convolutional layer will update the representation of each vertex in a given graph where this update is a function of the current representation of that vertex plus the representations of its adjacent neighbours. In this section we only briefly review existing graph convolutional layers but the interested reader can find a more indepth analysis in the following review papers [49, 38].

Gilmer et al. [9] showed that many different convolutional layers may be reformulated in terms of a framework called Message Passing Neural Networks defined in terms of a message function and an update function . In this framework vertex representations are updated according to Equation 1 where denotes the representation of vertex output from the -th convolutional layer and denotes the set of vertices adjacent to . Each vertex representation is an element of where the dimension may vary from layer to layer. For the input layer, that is

, vertex representations equal a one-hot encoding of the vertex labelling function

and therefore the corresponding dimension is . For all subsequent layers the corresponding dimension is a model hyper-parameter.

(1)

In the proposed graph classification model we use the functions and originally proposed by Hamilton et al. [12] and defined in Equation 2. Here CONCAT is the horizontal vector concatenation operation, and are the weights and biases respectively for the

-th convolutional layer, and ReLU is the real valued rectified linear unit non-linearity.

(2)

A sequence of two convolutional layers were used in the proposed model. A number of studies have found that the use of two layers empirically gives the best performance [19]. This sequence of layers will map a graph to a set of points in where is the dimension of the final convolutional layer. Since the number of vertices in a graph may vary the number of points in may in turn vary. Let us denote by Set the space of sets of points in . Given this, the sequence of convolutions layers defines a map .

3.2 Mapping to RKHS

The output from the sequence of convolutional layers defined in the previous subsection is an element in the space Set. In this section we propose a method for mapping elements in this space to a reproducing kernel Hilbert space (RKHS). We in turn define a kernel between elements in this space.

A Hilbert space is a vector space with an inner product such that the induced norm turns the space into a complete metric space. A positive-semidefinite kernel on a set is a function such that there is a feature space and a map such that where and denotes the dot product in . Equivalently, a function is a kernel if and only if for every subset , the matrix with entries is positive semi-definite. Given a kernel , one can define a map as Equation 3 where codomain of this map is the space of real valued functions on . Such a space is called a function space. Given this, it can be proven that . By virtue of this property is called a reproducing kernel Hilbert space (RKHS) corresponding to the kernel [31].

(3)

Let be the Gaussian kernel function defined in Equation 4 where and .

(4)

Given , we define a map in Equation 5 where is the space of real valued functions on . To illustrate this map consider the element of Set displayed in Figure 1 where the dimension equals 2. Recall that elements in the space Set correspond to sets of points in . Figures 1 and 1 display the elements of resulting from applying the map to this element of Set with parameter values of and respectively.

(5)

The parameter of the map is a scale parameter and may be interpreted as follows. As the value of approaches , becomes a sum of a set indicator functions applied to . In this case distinct elements of the space Set map to distinct elements of where the distance between these functions measured by the norm is greater than zero. On the other hand, as approaches , differences between the functions are gradually smoothed out and in turn the distance between the functions gradually reduces. Therefore, one can view the parameter as controlling the discrimination power of the method.

Figure 1: An element of the space Set is displayed in (a) where the dimension equals 2 and each point in is represented by a red dot. The elements of resulting from applying the map to this element with parameter values of and are displayed in (b) and (c) respectively.

Given the map defined in Equation 5, we define the kernel in Equation 6. Note that, the final equality in this equation follows from the reproducing property of the RKHS related to and the bilinearity of the inner product [28]. By examination of Equation 6, we see that the kernel equals the dot between between elements in the codomain of the map which is an infinite dimensional function space. That is, the kernel allows us to operate in this codomain without the computational complexity of explicitly mapping into it

(6)
Theorem 1.

The kernel is a positive-semidefinite kernel.

Proof.

The kernel is a positive-semidefinite kernel because it is defined in Equation 6 to equal the dot product in the space . ∎

The kernel has a specific scale which is specified by . In order to adopt a multi-scale approach we consider a set of scales to define a corresponding set of kernels . We combine these kernels to using a linear combination defined in Equation 7 where . Let denote the reproducing kernel Hilbert space (RKHS) corresponding to the kernel [11].

(7)
Theorem 2.

The kernel is a positive-semidefinite kernel.

Proof.

The kernel is a positive-semidefinite kernel because it is the sum of positive-semidefinite kernels and the coefficients are all positive (see proposition 13.1 in [31]). ∎

3.3 Svm

Recall that we consider the problem of graph classification whereby given and we wish to learn a map .

Let be a map from which we obtain a decision function by sgn. That is, if

returns a positive value we classify the graph in question as

and otherwise we classify it as . We determine a suitable map lying in the RKHS corresponding to the kernel by Equation 8. Note that, the first term in this sum corresponds to the soft margin loss [31] and the second term is a regularization term.

(8)

By the representer theorem any solution to Equation 8 can be written in the form of Equation 9 where [28].

(9)

Substituting this into Equation 8 we obtain Equation 10 where optimization of the function is performed with respect to . Here , is the elementwise multiplication operator (Hadamard product), is a vector of zeros of size and is a vector of ones of size .

(10)

3.4 End-to-End Optimization

As described in the previous subsections, the proposed classification model contains three steps with each having corresponding parameters which require optimization with respect to the objective function defined in Equation 10. The parameters in question are the sets of convolutional layer parameters and defined in Equation 2, the sets of kernel parameters and defined in Equation 7, and the set of SVM parameters defined in Equation 9. All of these parameters are unconstrained real values apart from the sets of kernel parameters and which are constrained to be positive real values. As such, the optimization problem in question is a constrained optimization problem. In this work we wish to optimize all the above model parameters jointly in an end-to-end manner. We refer to this as the end-to-end optimization problem. Note that, if only the SVM parameters were optimized and all other parameters were fixed the optimization problem could be formulated as a quadratic program by taking the dual and solved in closed-form [31]. This is the some commonly used method for optimizing the parameters of an SVM.

In order to solve the end-to-end optimization problem we use a gradient based optimization method. Such methods are the most commonly used methods for optimizing neural network parameters [10]. There are two main approaches that can be used to apply a gradient based optimization method to a constrained optimization problem. The first approach is to project the result of each gradient step back into the feasible region. The second approach is to transform the constrained optimization problem into an unconstrained optimization problem and solve this problem. Such a transformation can be achieved using the Karush-Kuhn-Tucker (KKT) method [27]. In this work we use the former approach. In practice this reduces to passing the parameters and through the function

after each gradient step. The above optimization can be used in conjunction with any gradient based optimization method such as stochastic gradient descent. In this work the Adam method was used

[18].

4 Evaluation

In this section we present an evaluation of the proposed end-to-end graph classification model with respect to current state-of-the-art models. This section is structured as follows. Section 4.1 provides implementation details for the proposed model. Section 4.2 describes the baseline models used to compare the proposed model against. Finally, section 4.3 describes the datasets used in this evaluation and compares the performance of all models on these datasets.

4.1 Implementation Details

The parameters of the proposed model were initialized as follows. The convolutional layer weights and biases in Equation 2 were initialized using Kaiming initialization [14] and to a value of respectively. The kernel parameters and in Equation 7 were all initialized to a value of .

The model hyper-parameters were set as follows. The dimension of the convolutional hidden layers was set equal to . The Adam optimizer learning rate was set to its default value of

and training was performed for 300 epochs. The hyper-parameters

in Equation 10 and in Equation 7 were selected from the sets and respectively by considering classification accuracy on a validation set.

The proposed classification model was implemented in Python3 using PyTorch. All experiments were run on an Nvidia GeForce RTX 2080 GPU. As can be observed from Equations

6 and 7

, the proposed kernel function reduces to a series of summations. If this was naively implemented using a series of Python for loops this would result in slow learning and inference. To overcome this issue we performed vectorization whereby the kernel was implemented using a series of PyTorch tensor operations. These tensor operations are implemented in C++ as opposed to Python and therefore result in significantly faster learning and inference. The authors will make this code available upon publication of this paper.

The time and space complexity of classifying a given graph is O() where is the number of graphs in the training dataset. This is a consequence of the summation in Equation 9 over all training examples. The time and space complexity of performing an update of the method parameters using backprop is O() because this step computes the complete kernel matrix in Equation 10.

4.2 Baseline Methods

As described in the related work section of this paper, existing models for graph classification belong to two main categories of feature engineered kernel and end-to-end deep learning models. Recent studies have found that the latter category of models outperform the former [44]. Therefore for the purposes of this evaluation we only considered end-to-end deep learning models.

A total of 12 baseline methods were considered. We considered so many baseline methods to ensure we were comparing to state of the art; many existing methods claim to outperform each other so it is difficult to determine which methods are in fact state of the art. The baseline methods considered in the evaluation are end-to-end methods but not kernel-based methods. The proposed method is the first end-to-end kernel-based method for graphs.

Implementations for these in PyTorch were obtained from the PyTorch Geometric Python library [6] and can be downloaded directly from the benchmark section of the PyTorch Geometric website 111https://github.com/rusty1s/pytorch_geometric/tree/master/benchmark. Model parameters were optimized using the Adam optimizer with the default learning rate of

and run for 300 epochs. For all baseline models a negative log likelihood loss function was used. Model hyperparameters corresponding to the number and dimension of hidden layers were selected from the sets

and respectively by considering the loss on a validation set.

Model MUTAG PTC_MR BZR_MD PTC_FM
GCN 0.73 0.06 0.57 0.03 0.68 0.06 0.61 0.08
GCNWithJK 0.73 0.07 0.57 0.05 0.68 0.07 0.62 0.08
GIN 0.82 0.07 0.54 0.05 0.62 0.09 0.57 0.07
GIN0 0.85 0.04 0.57 0.08 0.63 0.13 0.59 0.05
GINWithJK 0.83 0.07 0.55 0.07 0.61 0.15 0.60 0.04
GIN0WithJK 0.83 0.06 0.54 0.07 0.63 0.10 0.59 0.06
GraphSAGE 0.72 0.06 0.57 0.08 0.68 0.09 0.61 0.06
GraphSAGEWithJK 0.71 0.09 0.56 0.04 0.68 0.09 0.60 0.07
DiffPool 0.84 0.12 0.57 0.03 0.69 0.07 0.61 0.06
GlobalAttentionNet 0.74 0.07 0.56 0.05 0.67 0.06 0.63 0.06
Set2SetNet 0.73 0.07 0.56 0.03 0.68 0.10 0.62 0.06
SortPool 0.75 0.11 0.59 0.08 0.66 0.09 0.60 0.08
Proposed Model 0.87 0.06 0.60 0.08 0.62 0.10 0.62 0.06
Table 1:

For each dataset, the mean classification accuracy plus standard deviation of 10-fold cross validation for each graph classification model are displayed.

We now briefly describe the architectures of the 12 baseline models; specific implementation details can be found at the benchmark section of the PyTorch Geometric website:

GCN - This model consists of graph convolutional layers proposed by Kipf et al. [19]

, followed by mean pooling, followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

GCNWithJK - This model is equal to GCN but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [41].

GIN - This model consists of the graph convolutional layers proposed by Xu et al. [40], followed by mean pooling, followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer. The convolution layer in question has a parameter which is learned.

GIN0 - This model is equal to GCN with the exception that the parameter is not learned and instead is set to a value of .

GINWithJK - This model is equal to GIN but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [41].

GIN0WithJK - This model is equal to GIN0 but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [41].

GraphSAGE - This model consists of the graph convolutional layers proposed by Hamilton et al. [12], followed by a mean pooling layer, followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

GraphSAGEWithJK - This model is equal to GraphSAGE but with the addition of jump or skip connections before mean pooling as proposed by Xu et al. [41].

DiffPool - This model consists of the graph convolutional layers proposed by Hamilton et al. [12], followed by the pooling method proposed by Ying et al. [44], followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

GlobalAttentionNet - This model consists of the graph convolutional layers proposed by Hamilton et al. [12], followed by the pooling layer proposed by Li et al. [24], followed by a dropout layer, followed by a non-linear layer, followed by a linear layer, followed by a softmax layer.

Set2SetNet - This model consists of the graph convolutional layers proposed by Hamilton et al. [12], followed by the pooling layer proposed by Vinyals et al. [36], followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.

SortPool - This model consists of the graph convolutional layers proposed by Hamilton et al. [12], followed by the pooling layer proposed by Zhang et al. [47], followed by a non-linear layer, followed by a dropout layer, followed by linear layer, followed by a softmax layer.

4.3 Datasets and Results

To evaluate the proposed graph classification model we considered four commonly used graph classification datasets obtained from the TU Dortmund University graph dataset repository [17] 222https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets. The first dataset is MUTAG dataset which contains 188 graphs corresponding to chemical compounds where there are distinct types of vertices. The classification problem is binary and concerns predicting a particular characteristic of the chemical [3]. The second dataset is PTC_MR dataset which contains 344 graphs corresponding to chemical compounds where there are distinct types of vertices. The classification problem is binary and concerns predicting a carcinogenicity property. The third dataset is the BZR_MD dataset which contains 306 graphs corresponding to chemical compounds where there are distinct types of vertices. The classification problem is binary and concerns predicting a particular characteristic of the chemical [35]. The final dataset is the PTC_FM dataset which contains 349 graphs corresponding to chemical compounds where there are distinct types of vertices. The classification problem is binary and concerns predicting a carcinogenicity property.

Stratified -folds cross-validation with a value of was used to split the data into training and testing sets. One of the folds in the training set was randomly selected to be a validation set and classification accuracy on this set was used to select model hyperparameters. The same training, testing and validation splits were used for all graph classification models considered. This is an important point because the performance of a given model may vary as a function of the split used. For each dataset we computed the mean accuracy on the test sets for each method. The results of this analysis are displayed in Table 1. For two of the four datasets, the proposed graph classification model outperformed all baseline methods. In fact, on the MUTAG dataset the proposed model outperformed all baseline methods by a significant margin. For the remaining two datasets, the proposed method outperformed many but not all baseline methods. These positive results demonstrate the utility of the proposed model.

It is important to note that the proposed method was compared against a large number of benchmark methods (12). This makes it challenging for any single method to perform best on all datasets. It is difficult to interpret exactly why one deep learning architecture performs better or worse than another on a particular dataset. However, one limitation of the proposed method that may limit its ability to accurately discriminate is that it only methods the distribution of node embeddings and not the position of these nodes in the graph. The recent work by You et al. [45] suggests position information is important. The DiffPool method which performed best on the BZR_MD dataset actually uses node position information when performing clustering in the pooling step (this is illustrated in Figure 1 of the original paper by Ying et al. [44]). We hypothesize that position information may not be important for some graph classification tasks while being important for others. This may explain why the proposed method does not uniformly outperform all others. It is also worth noting that the proposed method achieved similar performance to the GIN method on the BZR_MD dataset. In a recent paper by Errica et al. [5], the authors found the GIN method to achieve best results on a number of datasets.

5 Conclusions and Future Work

This article proposes a novel kernel-based support vector machine (SVM) for graph classification. Unlike existing kernel-based models, the proposed model is trained in a supervised end-to-end manner whereby the convolutional layers, the kernel function and SVM parameters are jointly optimized. The proposed model outperforms existing deep learning models on a number of datasets which demonstrates the utility of the model.

Despite these positive results, the proposed model is not a suitable candidate solution for all graph classification problems. Like all kernel-based models, the proposed model does not natively scale to large datasets. This is a consequence of the fact that training the model requires computation and storing of the kernel matrix whose size is quadratic in the number of training examples. This limitation may potentially be overcome by performing an approximation of the kennel function [29]. The authors plan to investigate this research direction in future work.

References

  • [1] K. M. Borgwardt and H. Kriegel (2005) Shortest-path kernels on graphs. In IEEE international conference on data mining, pp. 8–pp. Cited by: §2.2.
  • [2] V. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei (2019) Scene graph prediction with limited labels. In International Conference on Computer Vision, Cited by: §1.
  • [3] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch (1991) Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry 34 (2), pp. 786–797. Cited by: §4.3.
  • [4] D. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.1.
  • [5] F. Errica, M. Podda, D. Bacciu, and A. Micheli (2019) A fair comparison of graph neural networks for graph classification. arXiv preprint arXiv:1912.09893. Cited by: §4.3.
  • [6] M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §4.2.
  • [7] L. Franceschi, M. Niepert, M. Pontil, and X. He (2019) Learning discrete structures for graph neural networks. In International Conference on Machine Learning, pp. 1972–1982. Cited by: §1.
  • [8] A. Gagarin and P. Corcoran (2018) Multiple domination models for placement of electric vehicle charging stations in road networks. Computers & Operations Research 96, pp. 69–79. Cited by: §1.
  • [9] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272. Cited by: §1, §2.1, §3.1.
  • [10] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §3.4.
  • [11] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §3.2.
  • [12] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §3.1, §4.2, §4.2, §4.2, §4.2, §4.2.
  • [13] D. Haussler (1999) Convolution kernels on discrete structures. Technical report Cited by: §2.2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §4.1.
  • [15] S. Ivanov and E. Burnaev (2018-10–15 Jul) Anonymous walk embeddings. In International Conference on Machine Learning, Vol. 80, Stockholmsmassan, Stockholm Sweden, pp. 2191–2200. Cited by: §1, §2.2.
  • [16] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley (2016) Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design 30 (8), pp. 595–608. Cited by: §2.1.
  • [17] K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann (2016) Benchmark data sets for graph kernels. External Links: Link Cited by: §4.3.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [19] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §3.1, §4.2.
  • [20] R. Kondor and H. Pan (2016) The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pp. 2990–2998. Cited by: §2.2.
  • [21] N. M. Kriege, P. Giscard, and R. Wilson (2016) On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pp. 1623–1631. Cited by: §2.2.
  • [22] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1.
  • [23] L. Landrieu and M. Simonovsky (2018-06) Large-scale point cloud semantic segmentation with superpoint graphs. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1.
  • [24] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In International Conference on Learning Representations, Cited by: §2.1, §4.2.
  • [25] E. Luzhnica, B. Day, and P. Liò (2019) On graph classification networks, datasets and baselines. In ICML Workshop on Learning and Reasoning with Graph-Structured Representations, Cited by: §2.1.
  • [26] G. Nikolentzos, P. Meladianos, and M. Vazirgiannis (2017) Matching node embeddings for graph similarity. In

    AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.2.
  • [27] J. Nocedal and S. Wright (2006) Numerical optimization. Springer Science & Business Media. Cited by: §3.4.
  • [28] V. I. Paulsen and M. Raghupathi (2016) An introduction to the theory of reproducing kernel hilbert spaces. Vol. 152, Cambridge University Press. Cited by: §3.2, §3.3.
  • [29] A. Rahimi and B. Recht (2008) Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184. Cited by: §2.2, §5.
  • [30] B. Rieck, C. Bock, and K. Borgwardt (2019) A persistent weisfeiler-lehman procedure for graph classification. In International Conference on Machine Learning, pp. 5448–5458. Cited by: §2.2.
  • [31] B. Scholkopf and A. J. Smola (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. Cited by: §3.2, §3.2, §3.3, §3.4.
  • [32] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §2.2.
  • [33] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §2.2.
  • [34] M. Sugiyama and K. Borgwardt (2015) Halting in random walk kernels. In Advances in neural information processing systems, pp. 1639–1647. Cited by: §2.2.
  • [35] J. J. Sutherland, L. A. O’brien, and D. F. Weaver (2003)

    Spline-fitting with a genetic algorithm: a method for developing classification structure- activity relationships

    .
    Journal of chemical information and computer sciences 43 (6), pp. 1906–1915. Cited by: §4.3.
  • [36] O. Vinyals, S. Bengio, and M. Kudlur (2016) Order matters: sequence to sequence for sets. In International Conference on Learning Representations, San Juan, Puerto Rico. Cited by: §2.1, §4.2.
  • [37] L. Wu, I. E. Yen, Z. Zhang, K. Xu, L. Zhao, X. Peng, Y. Xia, and C. Aggarwal (2019) Scalable global alignment graph kernel using random features: from node embedding to graph embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1418–1428. Cited by: §2.2.
  • [38] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §3.1.
  • [39] D. Xu, Y. Zhu, C. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [40] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §4.2.
  • [41] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018-10–15 Jul) Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, Vol. 80, Stockholmsmassan, Stockholm Sweden, pp. 5453–5462. Cited by: §4.2, §4.2, §4.2, §4.2.
  • [42] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [43] P. Yanardag and S. Vishwanathan (2015) Deep graph kernels. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §1, §2.2.
  • [44] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1, §2.1, §4.2, §4.2, §4.3.
  • [45] J. You, R. Ying, and J. Leskovec (2019-06) Position-aware graph neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 7134–7143. Cited by: §4.3.
  • [46] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec (2018) GraphRNN: generating realistic graphs with deep auto-regressive models. In International Conference on Machine Learning, pp. 5694–5703. Cited by: §1.
  • [47] M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In AAAI Conference on Artificial Intelligence, Cited by: §2.1, §4.2.
  • [48] Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai (2018) Retgk: graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems, pp. 3964–3974. Cited by: §2.2.
  • [49] Z. Zhang, P. Cui, and W. Zhu (2018) Deep learning on graphs: a survey. arXiv preprint arXiv:1812.04202. Cited by: §3.1.