1. Introduction
Graph kernels are one of the most important methods for graph data analysis and have been successfully applied in diverse fields such as disease and brain analysis (Mokhtari and HosseinZadeh, 2013; Chen and Wu, 2017), chemical analysis (Ralaivola et al., 2005), image action recognition and scene modeling (Wang and Sahbi, 2013; Fisher et al., 2011), and malware analysis(Wagner et al., 2009). Since there are no explicit features in graphs, a kernel function corresponding to a highdimensional feature space provides a flexible way to represent each graph and to compute similarities between them. Hence, much effort has been devoted to designing feature spaces or kernel functions for capturing similarities between structural properties of graphs.
The first line of research focuses on local patterns within graphs (Gärtner et al., 2003; Shervashidze and Borgwardt, 2009). Specifically, these kernels recursively decompose the graphs into small substructures, and then define a feature map over these substructures for the resulting graph kernel. Conceptually, these notable graph kernels can be viewed as instances of a general kernellearning framework called Rconvolution for discrete objects (Haussler, 1999; Shervashidze et al., 2011). However, the aforementioned approaches consider only local patterns rather than global properties, which may substantially limit effectiveness in some applications, depending on the underlying structure of graphs. Equally importantly, most of these graph kernels scale poorly to large graphs due to their at least quadratic time complexity in terms of the number of graphs and cubic time complexity in terms of the size of graphs.
Another family of methods use geometric embeddings of graph nodes to capture global properties, which has shown great promise, achieving stateoftheart performance in graph classification (Johansson et al., 2014; Johansson and Dubhashi, 2015; Nikolentzos et al., 2017). However, these global graph kernels based on matching node embeddings between graphs may suffer from the loss of positive definiteness. Furthermore, the majority of these approaches have at least quadratic complexity in terms of either the number of graph samples or the size of the graph.
To address these limitations of existing graph kernels, we propose a new family of global graph kernels that take into account the global properties of graphs, based on recent advances in the distance kernel learning framework (Wu et al., 2018b). The proposed kernels are truly positivedefinite (p.d.) kernels constructed from a random feature map given by a transportation distance between a set of geometric node embeddings of raw graphs and those of random graphs sampled from a distribution. In particular, we make full use of the wellknown Earth Mover’s Distance (EMD), computing the minimum cost of transporting a set of node embeddings of raw graphs to the ones of random graphs. To yield an efficient computation of the kernel, we derive a Random Features (RF) approximation using a limited number of random graphs drawn from either dataindependent or datadependent distributions. The methods used to generate highquality random graphs have a significant impact on graph learning. We propose two different sampling strategies depending on whether we use node label information or not. Furthermore, we note that each building block in this paper  geometric node embeddings and EMD  can be replaced by other node embeddings methods (Johansson and Dubhashi, 2015; Zhang et al., 2018) and transportation distances (Solomon et al., 2016). Our code is available at https://github.com/IBM/RandomGraphEmbeddings.
We highlight the main contributions as follows:

We propose a class of p.d. global alignment graph kernels based on their global properties derived from geometric node embeddings and the corresponding node transportation.

We present Random Graph Embeddings (RGE), a byproduct of the RF approximation, which yields an expressive graph embedding. Based on this graph embedding, we significantly reduce computational complexity at least from quadratic to (quasi)linear in both the number and the size of the graphs.

We theoretically show the uniform convergence of RGE. We prove that given random graphs, the inner product of RGE can uniformly approximates the corresponding exact graph kernel within
precision, with high probability.

Our experimental results on nine benchmark datasets demonstrate that RGE outperforms or matches twelve stateoftheart graph classification algorithms including graph kernels and deep graph neural networks. In addition, we numerically show that RGE can achieve (quasi)linear scalability with respect to both the number and the size of graphs.
2. Related Work
In this section, we first make a brief survey of the existing graph kernels and then detail the difference between conventional random features method for vector inputs
(Rahimi and Recht, 2008) and our random features method for structured inputs.2.1. Graph Kernels
Generally speaking, we can categorize the existing graph kernels into two groups: kernels based on local substructures, and kernels based on global properties.
The first group of graph kernels compare substructures of graphs, following a general kernellearning framework, i.e., Rconvolution for discrete objects (Haussler, 1999). The major difference among these graph kernels is rooted in how they define and explore substructures to define a graph kernel, including random walks (Gärtner et al., 2003), shortest paths (Borgwardt and Kriegel, 2005), cycles (Horváth et al., 2004), subtree patterns (Shervashidze and Borgwardt, 2009), and graphlets (Shervashidze et al., 2009). A thread of research attempts to utilize node label information using the WeisfeilerLeman (WL) test of isomorphism (Shervashidze et al., 2011) and takes structural similarity between substructures into account (Yanardag and Vishwanathan, 2015b, a) to further improve the performance of kernels.
Recently, a new class of graph kernels, which focus on the use of geometric node embeddings of graph to capture global properties, are proposed. These kernels have achieved stateoftheart performance in the graph classification task (Johansson et al., 2014; Johansson and Dubhashi, 2015; Nikolentzos et al., 2017). The first global kernel was based on the Lovász number (Lovász, 1979) and its associated orthonormal representation (Johansson et al., 2014). However, these kernels can only be applied on unlabelled graphs. Later approaches directly learn graph embeddings by using landmarks (Johansson and Dubhashi, 2015) or compute a similarity matrix (Nikolentzos et al., 2017) by exploiting different matching schemes between geometric embeddings of nodes of a pair of graphs. Unfortunately, the resulting kernel matrix does not yield a valid p.d. kernel and thus delivers a serious blow to hopes of using kernel support machine. Two recent graph kernels, the multiscale laplacian kernel (Kondor and Pan, 2016) and optimal assignment kernel (Kriege et al., 2016) were developed to overcome these limitations by building a p.d. kernel between node distributions or histogram intersection.
However, most of existing kernels only focus on learning kernel matrix for graphs instead of graphlevel representation, which can only be used for graph classification rather than other graph related tasks (e.g., graph matching). More importantly, how to align the nodes in two graphs plays a central role in learning a similarity score. In this paper, we rely on an optimal transportation distance (e.g., Earch Mover’s Distance) to learn the alignment between corresponding nodes that have similar structural roles in graphs, and directly generate a graphlevel representation (embedding) for each graph instead of explicitly computing a kernel matrix.
2.2. Random Features for Kernel Machines
Over the last decade, the most popular approaches for scaling up kernel method is arguably random features approximation and its fruitful variants (Rahimi and Recht, 2008; Sinha and Duchi, 2016; Bach, 2017; Wu et al., 2018a). Given a predefined kernel function, the inner product of RF directly approximates the exact kernel via sampling from a distribution, which leads to a fast linear method for computing kernel based on the learned lowdimensional feature representation. However, these RF approximation methods can only be applied to the shiftinvariant kernels (e.g., the Gaussian or Laplacian kernels) with vectorform input data. Since a graph is a complex object, the developed graph kernels are neither shiftinvariant kernels nor with vectorform inputs. Due to these challenges, to the best of our knowledge, there are no existing studies on how to develop the RF approximation for graph kernels.
A recent work, called D2KE (distances to kernels and embeddings) (Wu et al., 2018b), proposes the general methodology of the derivation of a positivedefinite kernel through a RF map from any given distance function, which enjoys better theoretical guarantees than other distancebased methods. In (Wu et al., 2018c), D2KE was extended to design a specialized timeseries embedding and showed the strong empirical performance for timeseries classification and clustering. We believe there is no work on applying D2KE to the graph kernel domain ^{1}^{1}1Upon acceptance of this paper, a parallel work (AlRfou et al., 2019) also adopted D2KE to develop an unsupervised neural network model for learning graphlevel embedding.. Our work is the first one to build effective and scalable global graph kernels using Random Features.
3. Geometric Embeddings of Graphs and Earth Mover’s Distance
In this section, we will introduce two important building blocks of our method, the geometric node embeddings that are used to represent a graph as a bagofvectors, and the wellknown transportation distance EMD.
3.1. Geometric Embeddings of Graphs
The following notation will be used throughout the paper. Let a graph consisting of nodes, edges, and discrete node labels be represented as a triplet , where is the set of vertices, is the set of undirected edges, and is a function that assigns the label information to nodes from an alphabet set . In this paper, we will consider both unlabeled graphs and graphs with discrete node labels. Let be a set of graphs where and let be a set of graph labels ^{2}^{2}2Note that there are two types of labels involved in our paper, i.e., the node labels and the graph labels. The node labels characterize the property of nodes. The graph labels are the classes that graph belongs to. corresponding to each graph in where . Let the geometric embeddings of a graph be a set of vectors for all nodes, where the vector in is the representation of the node , and is the size of latent node embedding space.
Typically, with different underlying learning tasks, a graph can be characterized by different forms of matrices. Without loss of generality, we use the normalized Laplacian matrix , where is the adjacency matrix with if and otherwise, and is the degree matrix. We then compute the
smallest eigenvectors of
to obtain as its geometric embeddings through the partial eigendecomposition of . Then each node will be assigned an embedding vector where is the ith row of the absolute . Note that since the signs of the eigenvectors are arbitrary, we use the absolute values. Let be the th item of the vector , then it satisfies . Therefore, the node embedding vectors can be viewed as points in a ddimensional unit hypercube. This fact plays an important role in our following sampling strategy.Note that although the standard dense eigensolvers require at least cubic time complexity in the number of graph nodes, with a stateoftheart iterative eigensolver (Stathopoulos and McCombs, 2010; Wu et al., 2017b)
, we can efficiently solve eigendecomposition with complexity that is linear in the number of graph edges. It is also worth noting that the resulting geometric nodes embeddings well capture global properties of the graph since the eigenvectors associated with low eigenvalues of
encode the information about the overall structure of based on the spectral graph theory (Von Luxburg, 2007).In the traditional model of Natural Language Processing, a bagofwords had been the most common way to represent a document. With modern deep learning approaches, each element such as a word in the document or a character in the string is embedded into a lowdimensional vector and is fed within a bagofvectors into recurrent neural networks that perform document and string classification. Similarly, we also represent each graph as bagofvectors using a set of geometric node embeddings. However, although there is canonical ordering for the nodes of a graph, it is not reliable in most case. Therefore, it is important to find an optimal matching between two sets of node embeddings when comparing two graphs.
3.2. Node Transportation via EMD
Now we assume that a graph is represented by the bagofvectors . To use the bagofwords model, we also need to compute weights associated with each node vector. To be precise, if node has outgoing edges, we use as a normalized bagofwords (nBOW) weight for each node. Our goal is to measure the similarity between a pair of graphs using a proper distance measure. Instead of treating it as an assignment problem solved by maximum weight matching as in (Johansson and Dubhashi, 2015), we cast the task as a wellknown transportation problem (Hitchcock, 1941), which can be addressed by using the Earth Mover’s Distance (Rubner et al., 2000).
Using EMD, one can easily measure the dissimilarity between a pair of graphs through node transportation, which essentially takes into account alignments between nodes. Let denote the maximum number of nodes in a pair of graphs . Since is the nBOW weight vector for the graph , it is easy to obtain that . Similarly, we have . Then the EMD is defined as
(1)  
where is the transportation flow matrix with denoting how much of node in travels to node in , and is the transportation cost matrix where each item denotes the distance between two nodes measured in their embedding space. Typically, the Euclidean distance is adopted. We note that with the distance is a metric in the embedding space, the EMD (1) also define a metric between two graphs (Rubner et al., 2000). An attractive attribute of the EMD is that it provides an accurate measurement of the distance between graphs with different nodes that are contextually similar but in different positions in the graph. The EMD distance has been observed to perform well on text categorization (Kusner et al., 2015) and graph classification (Nikolentzos et al., 2017). A straightforward way that defines a kernel matrix based on EMD that measures the similarity between graphs has been shown in (Nikolentzos et al., 2017) as follows:
(2) 
where is the centering matrix and is the EMD distance matrix from all the pairs of graphs. However, there are three problems. The first one is that the Kernel matrix in (2) is not necessarily positivedefinite. The second problem is that the EMD is expensive to compute, since its time complexity is . In addition, computing the EMD for each pair of graphs requires the quadratic time complexity in the number of graphs, which is highly undesirable for largescale graph data. In this paper, we propose a scalable global alignment graph kernel using the random features to simultaneously address all these issues.
4. Scalable Global Alignment Graph Kernel Using Random Features
In this section, we first show how to construct a class of the p.d. global alignment graph kernels from an optimal transportation distance (e.g., EMD) and then present a simple yet scalable way to compute expressive graph embeddings through the RF approximation. We also show that the inner product of the resulting graph embeddings uniformly converge to the exact kernel.
4.1. Global Alignment Graph Kernel Using EMD and RF
The core task is to build a positivedefinite graph kernel that can make full use of both computed geometric node embeddings for graphs and a distance measure considering the alignment of the node embeddings. We here define our global graph kernel as follows:
(3)  
Here is a random graph consisting of random nodes with their associated node embeddings , where each random node embedding is sampled from a ddimensional vector space . Thus, is a distribution over the space of all random graphs of variable sizes . Then we can derive an infinitedimensional feature map from the EMD between and all possible random graphs . One explanation of how our proposed kernel works is that a small random graph can implicitly partition a larger raw graph through node transportation (or node alignments) in the corresponding node embedding space using EMD, as illustrated in Fig. 1.
A more formal and revealing way to interpret our kernel defined in (3) is to express it as
(4)  
where,
(5) 
can be treated as the soft minimum function defined by two parameters and . Since the usual soft minimum is defined as , then Equation (5) can be regarded as its smoothed version, which uses parameter to control the degree of smoothness and is reweighted by a probability density . Interestingly, the value of (5) is mostly determined by the minimum of , when is Lipschitzcontinuous and is large. Since EMD is a metric as discussed above, we have
by the triangle inequality. The equality holds if the maximum size of the random graph, , is equal or greater than the original graph size . Therefore, the kernel value in (4) serves as a good approximation to the EMD between any pair of graphs and . By the kernel definition, it must be positivedefinite.
4.2. Random Graph Embedding: Random Features of Global Alignment Graph Kernel
In this section, we will introduce how to efficiently compute the proposed global alignment graph kernels and derive the random graph embedding that can be used for representing graphlevel embedding from the geometric node embeddings.
4.2.1. Efficient Computation of RGE
Exact computation of the proposed kernel in (3) is often infeasible, as it does not admit a simple analytic solution. A natural way to compute such kernel is to resort to a kernel approximation that is easy to compute while uniformly converges to the exact kernel. As one of the most effective kernel approximation techniques, random features method has been demonstrated great successes in approximating Gaussian Kernel (Rahimi and Recht, 2008; Le et al., 2013) and Laplacian Kernel (Wu et al., 2016) in various applications. However, as we discussed before in Sec. 2.2, conventional RF methods cannot be directly applied to our graph kernels since they are not shiftinvariant and cannot deal with the inputs that are not vectorform. Moreover, for traditional RF methods, we have to know the kernel function prior before hand, which is also not available in our case. However, fortunately, since we can define our kernel in terms of a randomized feature map, it naturally yields the following random approximation that does not require aforementioned assumptions,
(6)  
where is a dimensional vector with the th term , and are i.i.d. samples drawn from . Note that the vector just can be considered as the representation (embedding) of graph . We call this random approximation ”random graph embedding (RGE)”, a generalized concept of ”random features” for our graph inputs. We will also show that this random approximation RGE admits the uniform convergence to the original kernel (3) over all pairs of graphs .
Algorithm 1 summarizes the procedure to generate feature vectors for data graphs. There are several comments to make here. First of all, the distribution is the key to generating highquality node embeddings for random graphs. We propose two different ways to generate random graphs, which we will illustrate in detail later. Second, the size
of the random graphs is typically quite small. An intuitive explanation why a small random graph captures important global information of raw graphs has been discussed in the previous section. However, since there is no prior information to determine how many random nodes is needed to segment the data graph for learning discriminatory features, we sample the size of the random graphs from a uniform distribution
to obtain an unbiased estimate of
. Finally, both node embedding and distance measures can be further improved by exploiting the latest advancements in these techniques.By efficiently approximating the proposed global alignment graph kernel using RGE, we obtain the benefits of both improved accuracy and reduced computational complexity. Recall that the computation of EMD has time complexity and thus the existing graph kernels require at least computational complexity and memory consumption, where and are the number of graphs and the average size of graphs, respectively. Because of the small size of random graphs, the computation of EMD in our RGE approximation only requires (Bourgeois and Lassalle, 1971). It means that our RGE approximation only requires computation with the quasilinear complexity if we treat as a constant (or a small number). Note that with a stateoftheart eigensolver (Stathopoulos and McCombs, 2010; Wu et al., 2017b), we can effectively compute the largest eigenvectors with linear complexity , where is the number of graph edges and is the number, typically quite small, of iterations of iterative eigensolver. Therefore, the total computational complexity and memory consumption of RGE are and respectively. Compared to other graph kernels, our method reduces computational complexity from quadratic to linear in terms of the number of graphs, and from (quasi)cubic to (quasi)linear in terms of the graph size. We will empirically assess the computational runtime in the subsequent experimental section.
4.2.2. Dataindependent and Datadependent Distributions
Algorithm 2 details the two sampling strategies (dataindependent and datadependent distributions) for generating a set of node embeddings of a random graph. The first scheme is to produce random graphs from a dataindependent distribution. Traditionally, conventional RF approximation has to obtain random samples from a distribution corresponding to the user predefined kernel (e.g., Gaussian or Laplacian kernels). However, since we reverse the order by firstly defining the distribution and then defining a kernel similar to (Wu et al., 2018b), we are free to select any distribution that can capture the characteristics of the graph data well. Given that all node embeddings are distributed in a dimensional unit hypercube space, we first compute the largest and smallest elements in all node embeddings and then use a uniform distribution in the range of these two values to generate a set of dimensional vectors for random node embeddings in a random graph. Since node embeddings are roughly dispersed uniformly in the dimensional unit hypercube space, we found this scheme works well in most of cases. Like the traditional RF, this sampling scheme is dataindependent. So we call it RGE(RF).
Another scheme is conceptually similar to recently proposed work on deriving datadependent traditional random features (Ionescu et al., 2017) for vectorinputs, which have been shown to have a lower generalization error than dataindependent random features (Rudi and Rosasco, 2017). However, unlike these conventional RF methods (Ionescu et al., 2017; Rudi and Rosasco, 2017) and the conventional landmarks method that selects a representative set of whole graphs (Johansson and Dubhashi, 2015), we propose a new way to sample parts of graphs (only from training data) as random graphs, which we refer to as the Anchor SubGraphs (ASG) scheme RGE(ASG). There are several potential advantages compared to lankmarks and RF methods. First of all, ASG opens the door to defining an indefinite feature space since there are conceptually unlimited numbers of subgraphs, compared to the limited size (up to the number of graphs) of landmarks. Second, ASG produces a random graph by permuting graph nodes of the original graph and by resembling randomly their corresponding node embeddings in the node embedding space, which may help to identify more hidden global structural information instead of only considering the raw graph topology. Thanks to EMD, hidden global structure can be captured well through node alignments. Finally, unlike RGE(RF), the ASG scheme allows exploiting nodelabel information in raw graphs since this information is also accessible through the sampled nodes in subgraphs.
Incorporating the node label information into RGE(ASG) is fairly straightforward; it is desirable to assign nodes with same labels a smaller distance than these with different labels. Therefore, we can simply set the distance if nodes and have different node labels since is the largest distance in a dimensional unit hypercube space.
4.3. Convergence of Random Graph Embedding
In this section, we establish a bound on the number of random graphs required to guarantee an approximation between the exact kernel (3) and its random feature approximation (6) denoted by . We first establish a covering number for the space under the EMD metric.
Lemma 0 ().
There is an covering of under the metric defined by EMD with Euclidean ground distance such that
with , where is an upper bound on the number of nodes for any graph .
Proposition 0 ().
Let . We have that if , , where is an covering of , and is the parameter of , then , .
Thus, given random graphs, the inner product of RGE can uniformly approximates the corresponding exact graph kernel within precision, with high probability, as shown in the following Theorem.
Theorem 3 ().
The uniform convergence rate is
Therefore, to guarantee with probability at least , it suffices to have
Proof.
5. Experiments
We performed experiments to demonstrate the effectiveness and efficiency of the proposed method, and compared against a total of twelve graph kernels and deep graph neural networks on nine benchmark datasets^{3}^{3}3http://members.cbio.minesparistech.fr/ nshervashidze/code/ widely used for testing the performance of graph kernels. We implemented our method in Matlab and utilized the CMEX function^{4}^{4}4http://ai.stanford.edu/rubner/emd/default.htm for the computationally expensive component of EMD. All computations were carried out on a DELL system with Intel Xeon processors 272 at 2.93GHz for a total of 16 cores and 250 GB of memory, running the SUSE Linux operating system.
Datasets. We applied our method to widelyused graph classification benchmarks from multiple domains (Shervashidze et al., 2011; Vishwanathan et al., 2010; Yanardag and Vishwanathan, 2015b); MUTAG, PTCMR, ENZYMES, PROTEINS, NCI1, and NCI109 are graphs derived from small molecules and macromolecules, and IMDBB, IMDBM, and COLLAB are derived from social networks. All datasets have binary labels except ENZYMES, IMDBM, and COLLAB which have 6, 3 and 3 classes, respectively. All bioinformatics graph datasets have node labels while all other social network graphs have no node labels. Detailed descriptions of these 9 datasets, including statistical properties, are provided in the Appendix.
Baselines. Due to the large literature, we compare our method RGE against five representative global kernels related to our approach and three classical graph kernels, including EMDbased Indefinite Kernel (EMD) (Nikolentzos et al., 2017), Pyramid Match Kernel (PM) (Nikolentzos et al., 2017), Lovász Kernel (Lo) (Johansson et al., 2014), Optimal Assignment Matching (OA(A)) (Johansson and Dubhashi, 2015), Vertex Optimal Assignment Kernel (VOA) (Kriege et al., 2016), Random Walk Kernel (RW) (Gärtner et al., 2003), Graphlet Kernel (GL) (Shervashidze et al., 2009), and Shortest Path Kernel (SP) (Borgwardt and Kriegel, 2005). Furthermore, we also compare RGE with several variants of WeisfelerLeman Graph Kernel (WLST (Shervashidze et al., 2011), WLSP (Shervashidze et al., 2011), and WLOA(A) (Johansson and Dubhashi, 2015)
). Finally, we compare RGE against four recently developed deep learning models with node labels, including Deep Graph Convolutional Neural Networks (DGCNN),
(Wu et al., 2017a); PATCHYSAN (PSCN) (Niepert et al., 2016), Diffusion CNN (DCNN) (Atwood et al., 2016), and Deep Graphlet Kernel (DGK) (Yanardag and Vishwanathan, 2015b). The first three models are built on convolutional neural networks on graphs while the last one is based on Word2Vec model. Since WL test is a generic technique to utilize discrete node labels for improving many standalone graph kernels, in this study, we first focus on testing the capability of each graph kernel without node labels and then assess the performance of each graph kernel with plain node labels and with WL techniques ^{5}^{5}5Our approach to combine RGE(ASG) with WL techniques is to first use WL to generate new node labels and then apply RGE(ASG) with these node labels..Setup. Since RGE is a graph embedding, we directly employ a linear SVM implemented in LIBLINEAR (Fan et al., 2008)
since it can faithfully separate the effectiveness of our feature representation from the power of the nonlinear learning solvers. Following the convention of the graph kernel literature, we perform 10fold crossvalidation, using 9 folds for training and 1 for testing, and repeat the whole experiments ten times (thus 100 runs per dataset) and report the average prediction accuracies and standard deviations. The ranges of hyperparameters
and are [1e3 1e2 1e1 1 10] and [3:3:30], respectively. All parameters of the SVM and hyperparameters of our method were optimized only on the training dataset. To eliminate the random effects, we repeat the whole experiments ten times and report the average prediction accuracies and standard deviations. For all baselines we have taken the best reported number from their papers. Since EMD is the closet method to ours, we execute both methods under the same setting for fair comparison and report both accuracy and computational time.Impacts of on Accuracy and Runtime of RGE. We conducted experiments investigating the convergence behavior and the scalability of three variants of RGE with or without using node labels when increasing the number of random graphs. The hyperparameter is obtained from the previous crossvalidations on the training set. We report both testing accuracy and runtime when increasing graph embedding size . As shown in Fig. 2, all variants of RGE converge very rapidly when increasing R from a small number (R = 4) to relatively large number (). This confirms our analysis in Theorem 3 that the RGE approximation can guarantee rapid convergence to the exact kernel. The second observation is that RGE exhibits quasilinear scalability with respect to , as predicted by our computational analysis. This is particularly important for large scale graph data since most graph kernels have quadratic complexity in the number of graphs and/or in the size of graphs.
Datasets  MUTAG  PTCMR  ENZYMES  NCI1  NCI019 

RGE(RF)  86.33 1.39(1s)  59.82 1.42(1s)  35.98 0.89(38s)  74.70 0.56(727s)  72.50 0.32(865s) 
RGE(ASG)  85.56 0.91(2s)  59.97 1.65 (1s)  38.52 0.91(18s)  74.30 0.45(579s)  72.70 0.42(572s) 
EMD  84.66 2.69 (7s)  57.65 0.59 (46s)  35.45 0.93 (216s)  72.65 0.34 (8359s)  70.84 0.18 (8281s) 
PM  83.83 2.86  59.41 0.68  28.17 0.37  69.73 0.11  68.37 0.14 
Lo  82.58 0.79  55.21 0.72  26.5 0.54  62.28 0.34  62.52 0.29 
OA(A)  79.89 0.98  56.77 0.85  36.12 0.81  67.99 0.28  67.14 0.26 
RW  77.78 0.98  56.18 1.12  20.17 0.83  56.89 0.34  56.13 0.31 
GL  66.11 1.31  57.05 0.83  18.16 0.47  47.37 0.15  48.39 0.18 
SP  82.22 1.14  56.18 0.56  28.17 0.64  62.02 0.17  61.41 0.32 
Datasets  PTCMR  ENZYMES  PROTEINS  NCI1  NCI019 
RGE(ASG)  61.5 2.34(1s)  48.27 0.99(28s)  75.98 0.71(20s)  76.46 0.45(379s)  74.42 0.30(526s) 
EMD  57.67 2.11 (42s)  42.85 0.72 (296s)  76.03 0.28 (1936s)  75.89 0.16 (7942s)  73.63 0.33 (8073s) 
PM  60.38 0.86  40.33 0.34  74.39 0.45  72.91 0.53  71.97 0.15 
OA(A)  58.76 0.92  43.56 0.66  —  69.83 0.30  68.96 0.35 
VOA  56.4 1.8  35.1 1.1  73.8 0.5  65.6 0.4  65.1 0.4 
RW  57.06 0.86  19.33 0.62  71.67 0.78  63.34 0.27  63.51 0.18 
GL  59.41 0.94  32.70 1.20  71.63 0.33  66.00 0.07  66.59 0.08 
SP  60.00 0.72  41.68 1.79  73.32 0.45  73.47 0.11  73.07 0.11 
WLRGE(ASG)  62.20 1.67(1s)  57.97 1.16(38s)  76.63 0.82(30s)  85.85 0.42(401s)  85.32 0.29(798s) 
WLST  57.64 0.68  52.22 0.71  72.92 0.67  82.19 0.18  82.46 0.24 
WLSP  56.76 0.78  59.05 1.05  74.49 0.74  84.55 0.36  83.53 0.30 
WLOA(A)  59.72 1.10  53.76 0.82  —  84.75 0.21  84.23 0.19 
Datasets  PTCMR  PROTEINS  NCI1  IMDBB  IMDBM  COLLAB 

(WL)RGE(ASG)  62.20 1.67  76.63 0.82  85.85 0.42  71.48 1.01  47.26 0.89  76.85 0.34 
DGCNN  58.59 2.47  75.54 0.94  74.44 0.47  70.03 0.86  47.83 0.85  73.76 0.49 
PSCN  62.30 5.70  75.00 2.51  76.34 1.68  71.00 2.29  45.23 2.84  72.60 2.15 
DCNN  56.6 1.20  61.29 1.60  56.61 1.04  49.06 1.37  33.49 1.42  52.11 0.53 
DGK  57.32 1.13  71.68 0.50  62.48 0.25  66.96 0.56  44.55 0.52  73.09 0.25 
Scalability of RGE varying graphs and nodes. We further assess the scalability of RGE when varying number of graphs and size of graph for randomly generated graphs. We change the number of graphs in the range of and the size of graph in the range of , respectively. When generating random adjacency matrices, we set the number of edges always be twice the number of nodes in a graph. We report the runtime for computing node embeddings using a stateoftheart eigensolver (Wu et al., 2017b), generating RGE graph embeddings, and the overall computation of graph classification, accordingly. Fig. 3(a) shows the linear scalability of RGE when increasing the number of graphs, confirming our complexity analysis in the previous Section. In addition, as shown in Fig. 3(b), RGE still exhibits linear scalablity in computing eigenvectors but slightly quasilinear scalablity in RGE generation time and overall time, when increasing the size of graph. This is because that even though RGE reduces conventional EMD’s complexity from supercubic to (where D is a small constant), the log factor starts to show its impact on computing EMD between raw graphs and small random graphs when becomes large (e.g. close to 1000). Interestingly, with a stateoftheart eigensolver, the complexity of computing a few eigenvectors is linearly proportional to the graph size (Wu et al., 2017b). This is highly desired property of our RGE embeddings, which open the door to largescale applications of graph kernels for various applications such as social networks analysis and computational biology.
Comparison with All Baselines. Tables 1, 2, and 3 show that RGE consistently outperforms or matches other stateoftheart graph kernels and deep learning approaches in terms of classification accuracy. There are several further observations worth making here. First, EMD, the closest method to RGE, shows good performance compared to most of other methods but often has significantly worse performance than RGE, highlighting the utility the novel graph kernel design using a feature map of random graphs and the effectiveness of a truly p.d. kernel. Importantly, RGE is also orders of magnitude faster than EMD in all cases, especially for data with a large graph size (like PROTEINS) or large number of graphs (like NCI1 and NCI109).
Second, the performance of RGE renders clear the importance of considering global properties graphs, and of having a distance measure able to align contextuallysimilar but positionallydifferent nodes, for learning expressive representations of graphs. In addition, as shown in Table 2, we observe that all methods (including RGE) gain performance benefits when considering the node label information or utilizing WL iterations based on node labels. With node label information, the gaps between RGE and other methods diminish but still showing very clear advantages of RGE.
Finally, as shown in Table 3, for biological datasets we used the WLRGE(ASG) to obtain the best performance with WL iteration. For social network datasets, we used the RGE(ASG) without node label since there are no node labels on these datasets. Compared to supervised deep learning based approaches, our unsupervised RGE method yet still shows clear advantages, highlighting the importance of aligning the structural roles of each node when comparing two graphs. In contrast, most of deep learning based methods focus on nodelevel representations instead of graphlevel representation (typically using meanpooling), which cannot take into account these important structural roles of each node in graphs.
6. Conclusion and Future Work
In this work, we have presented a new family of p.d. and scalable global graph kernels that take into account global properties of graphs. The benefits of RGE are demonstrated by its much higher graph classification accuracy compared with other graph kernels and its (quasi)linear scalability in terms of the number of graphs and graph size. Several interesting directions for future work are indicated: i) the graph embeddings generated by our technique can be applied and generalized to other learning problems such as graph (subgraph) matching or searching; ii) extensions of the RGE kernel for graphs with continuous node attributes and edge attributes should be explored.
References
 DDGK: learning graph representations for deep divergence graph kernels. arXiv:1904.09671. Cited by: footnote 1.
 Sparse diffusionconvolutional neural networks. In NIPS, Cited by: §5.

On the equivalence between kernel quadrature rules and random feature expansions.
Journal of Machine Learning Research
18 (21), pp. 1–38. Cited by: §2.2.  Shortestpath kernels on graphs. In Data Mining, Fifth IEEE International Conference on, pp. 8–pp. Cited by: §2.1, §5.
 An extension of the munkres algorithm for the assignment problem to rectangular matrices. Communications of the ACM 14 (12), pp. 802–804. Cited by: §4.2.1.
 Revisiting spectral graph clustering with generative community models. In ICDM, pp. 51–60. Cited by: §1.
 LIBLINEAR: a library for large linear classification. Journal of machine learning research 9 (Aug), pp. 1871–1874. Cited by: §B.3, §5.
 Characterizing structural relationships in scenes using graph kernels. ACM Transactions on Graphics (TOG) 30 (4), pp. 34. Cited by: §1.
 On graph kernels: hardness results and efficient alternatives. In Learning Theory and Kernel Machines, pp. 129–143. Cited by: §1, §2.1, §5.
 Convolution kernels on discrete structures. Technical report Department of Computer Science, University of California at Santa Cruz. Cited by: §1, §2.1.
 The distribution of a product from several sources to numerous localities. Studies in Applied Mathematics 20 (14), pp. 224–230. Cited by: §3.2.
 Cyclic pattern kernels for predictive graph mining. In KDD, pp. 158–167. Cited by: §2.1.
 Largescale datadependent kernel approximation. In Artificial Intelligence and Statistics, pp. 19–27. Cited by: §4.2.2.
 Learning with similarity functions on graphs using matchings of geometric embeddings. In KDD, pp. 467–476. Cited by: §B.3, §1, §1, §2.1, §3.2, §4.2.2, §5.
 Global graph kernels using geometric embeddings. In ICML, Cited by: §1, §2.1, §5.
 The multiscale laplacian graph kernel. In NIPS, pp. 2990–2998. Cited by: §2.1.
 On valid optimal assignment kernels and applications to graph classification. In NIPS, pp. 1623–1631. Cited by: §2.1, §5.
 From word embeddings to document distances. In ICML, pp. 957–966. Cited by: §3.2.
 Fastfoodapproximating kernel expansions in loglinear time. In ICML, Vol. 85. Cited by: §4.2.1.
 On the shannon capacity of a graph. IEEE Transactions on Information theory 25 (1), pp. 1–7. Cited by: §2.1.
 Decoding brain states using backward edge elimination and graph kernels in fmri connectivity networks. Journal of neuroscience methods 212 (2), pp. 259–268. Cited by: §1.
 Learning convolutional neural networks for graphs. In ICML, pp. 2014–2023. Cited by: §5.
 Matching node embeddings for graph similarity.. In AAAI, pp. 2429–2435. Cited by: §1, §2.1, §3.2, §5.
 Random features for largescale kernel machines. In NIPS, pp. 1177–1184. Cited by: §2.2, §2, §4.2.1.
 Graph kernels for chemical informatics. Neural networks 18 (8), pp. 1093–1110. Cited by: §1.

The earth mover’s distance as a metric for image retrieval
.International journal of computer vision
40 (2), pp. 99–121. Cited by: §3.2, §3.2.  Generalization properties of learning with random features. In NIPS, pp. 3218–3228. Cited by: §4.2.2.
 Fast subtree kernels on graphs. In NIPS, pp. 1660–1668. Cited by: §1, §2.1.
 Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1, §2.1, §5, §5.
 Efficient graphlet kernels for large graph comparison. In AIStats, pp. 488–495. Cited by: §2.1, §5.
 Learning kernels with random features. In NIPS, pp. 1298–1306. Cited by: §2.2.
 Continuousflow graph transportation distances. arXiv:1603.06927. Cited by: §1.
 PRIMME: preconditioned iterative multimethod eigensolver—methods and software description. ACM Transactions on Mathematical Software (TOMS) 37 (2), pp. 21. Cited by: §B.2, §3.1, §4.2.1.
 Graph kernels. Journal of Machine Learning Research 11, pp. 1201–1242. Cited by: §5.

A tutorial on spectral clustering
. Statistics and computing 17 (4), pp. 395–416. Cited by: §3.1. 
Malware analysis with graph kernels and support vector machines
. In Malicious and Unwanted Software, 2009 4th International Conference on, pp. 63–68. Cited by: §1.  Directed acyclic graph kernels for action recognition. In ICCV, pp. 3168–3175. Cited by: §1.

DGCNN: disordered graph convolutional neural network based on the gaussian mixture model
. arXiv:1712.03563. Cited by: §5.  Scalable spectral clustering using random binning features. In KDD, pp. 2506–2515. Cited by: §2.2.
 PRIMME_SVDS: a highperformance preconditioned svd solver for accurate largescale computations. SIAM Journal on Scientific Computing 39 (5), pp. S248–S271. Cited by: §B.2, §3.1, §4.2.1, §5.
 Revisiting random binning features: fast convergence and strong parallelizability. In KDD, pp. 1265–1274. Cited by: §4.2.1.
 D2KE: from distance to kernel and embedding. arXiv preprint arXiv:1802.04956. Cited by: §1, §2.2, §4.2.2.
 Random warping series: a random features method for timeseries embedding. In International Conference on Artificial Intelligence and Statistics, pp. 793–802. Cited by: §2.2.
 A structural smoothing framework for robust graph comparison. In NIPS, pp. 2134–2142. Cited by: §2.1.
 Deep graph kernels. In KDD, pp. 1365–1374. Cited by: §2.1, §5, §5.
 RetGK: graph kernels based on return probabilities of random walks. In NIPS, pp. 3968–3978. Cited by: §1.
Appendix A Appendix A: Proofs of Lemma 1 and Theorem 3
a.1. Proof of Lemma 1
Proof.
Since the geometric node embedding uses the normalized eigenvectors of the Laplacian matrix, we have that , i.e., belongs to a unit ball. Therefore, we can find an covering of size for the unit ball. Next, we define as all the possible sets of of size no larger than . So we have . For any graph , we can find with also nodes such that . Then by the definition of EMD (1), a solution that assigns each node in to a node in would have overall cost less than , So . ∎
a.2. Proof of Proposition 2
a.3. Proof of Theorem 3
Appendix B Appendix B: Additional Experimental Results
Dataset  MUTAG  PTC  ENZYMES  PROTEINS  NCI1  NCI109  IMDBB  IMDBM  COLLAB 

Max # Nodes  28  109  126  620  111  111  136  89  492 
Min # Nodes  10  2  2  4  3  4  12  7  32 
Ave # Nodes  17.9  25.6  32.6  39.05  29.9  29.7  19.77  13.0  74.49 
Max # Edges  33  108  149  1049  119  119  1249  1467  40119 
Min # Edges  10  1  1  5  2  3  26  12  60 
Ave # Edges  19.8  26.0  62.1  72.81  32.3  32.1  96.53  65.93  2457.34 
# Graph  188  344  600  1113  4110  4127  1000  1500  5000 
# Graph Labels  2  2  6  2  2  2  2  3  3 
# Node Labels  7  19  3  3  37  38  —  —  — 
General Setup. We perform experiments to demonstrate the effectiveness and efficiency of the proposed method, and compare against total 12 graph kernels and deep graph neural networks on 9 benchmark datasets (as shown in Table 4) ^{6}^{6}6http://members.cbio.minesparistech.fr/ nshervashidze/code/ that is widely used for testing the performance of graph kernels. We implement our method in Matlab and utilize CMEX function ^{7}^{7}7http://ai.stanford.edu/ rubner/emd/default.htm for the computationally expensive component of EMD. To accelerate the computation, we use multithreading with total 12 threads in all experiments. All computations were carried out on a DELL dual socket system with Intel Xeon processors 272 at 2.93GHz for a total of 16 cores and 250 GB of memory, running the SUSE Linux operating system.
b.1. Additional Results and Discussions on Accuracy and Runtime of RGE Varying
Setup. We now conduct experiments to investigate the behavior of three variants of RGE with or without using node labels by varying the number of random graphs. The hyperparameter is obtained from the previous crossvalidations on the training set. Depending on the size of graph on each dataset, we set in the range starting from 4 and ending with a number just satisfying . We report both testing accuracy and runtime when increasing graph embedding size .
b.2. Additional Results and Discussions on Scalability of RGE varying graphs and nodes
Setup. We assess the scalability of RGE when varying number of graphs and the size of a graph on randomly generated graphs. We change the number of graphs in the range of and the size of graph in the range of , respectively. When generating random adjacency matrices, we set the number of edges always be twice the number of nodes in a graph. We use the size of node embedding just like in the previous sections. We set the hyperparameters related to RGE itself are and . We report the runtime for computing node embeddings using stateoftheart eigensolver (Stathopoulos and McCombs, 2010; Wu et al., 2017b) and RGE graph embeddings, and the overall runtime, respectively.
b.3. Additional Results and Discussions on Comparisons Against All Baselines
Setup. Since RGE is a graph embedding, we directly employ a linear SVM implemented in LIBLIBNEAR (Fan et al., 2008) since it can faithfully examine the effectiveness of our feature representation from the power of the nonlinear learning solvers. Following the convention in the graph kernel literature, we perform 10fold crossvalidation, using 9 folds for training and 1 for testing, and repeat the whole experiments ten times (thus 100 runs per dataset) and report the average prediction accuracies and standard deviations. The ranges of hyperparameters and are [1e3 1e2 1e1 1 10] and [3:3:30], respectively. All parameters of the SVM and hyperparameters of our method were optimized only on the training dataset. The node embedding size is set to either 4, 6 or 8 but always be the same number for all variants of RGE on the same datasets. To eliminate the random effects, we repeat the whole experiments ten times and report the average prediction accuracies and standard deviations. For all baselines we take the best number reported in the papers except EMD, where we rerun the experiments for fair comparisons in terms of both accuracy and runtime. Since GRE, OA, EMD, and PM are essentially built on the same node embeddings from the adjacency matrices, we take the number of OA(A) in (Johansson and Dubhashi, 2015) for a fair comparison.