1 Introduction
Graphstructured data have become ubiquitous across domains over the last decades, with examples ranging from social and sensor networks, to chemo and bioinformatics. Graph kernels (Vishwanathan et al., 2010) have been very successful in dealing with the complexity of graphs and have shown good predictive performance on a variety of classification problems (Shervashidze et al., 2011; Morris et al., 2016; Yanardag and Vishwanathan, 2015). Most graph kernels rely on the Convolution framework (Haussler, 1999), which decomposes structured objects into substructures to compute local similarities that are then aggregated. While being generally successful for several applications, Convolution kernels on graphs are not without limitations: (1) the simplicity of the way in which the similarities between substructures are aggregated might limit their ability to capture complex nonlinear structural characteristics of the graph; (2) most proposed variants do not generalise to graphs with highdimensional continuous node attributes, and extensions are far from being straightforward. Various solutions have been proposed in order to address point (1). For example, Fröhlich et al. (2005) introduced kernels based on optimal assignment of node labels for molecular graphs, even though these kernels are not positive definite (Vert, 2008). More recently, another approach was proposed by Kriege et al. (2016) which set a new stateoftheart performance in graph classification on categorical node labels. However, this formulation cannot handle continuous nodes attributes, leaving point (2) as an open problem.
To overcome both limitations, we propose a method that combines the most successful vectorial graphrepresentations derived from the graph kernel literature with ideas from optimal transport theory, which have recently gained considerable attention. In particular, improvements of the computational strategies to efficiently obtain Wasserstein distances (Cuturi, 2013; Altschuler et al., 2017)
have led to many applications in machine learning that use it for various purposes, ranging from generative models
(Arjovsky et al., 2017)to new loss functions
(Frogner et al., 2015). For applications to graphs, notions from optimal transport were used to tackle the graph alignment problem (Xu et al., 2019). In this work, we provide the theoretical foundations of our method, then define a new graph kernel formulation and finally present successful experimental results. More precisely, our main contributions can be summarised as follows:
We present the graph Wasserstein distance, a new distance between graphs based on their node feature representations and we discuss how kernels can be derived from it;

We introduce a Weisfeiler–Lehman inspired embedding scheme that works for both categorically labelled and continuously attributed graphs and couple it with our graph Wasserstein distance;

We establish a new state of the art for graph kernels on traditional graph classification benchmarks with continuous attributes.
2 Background: Graph Kernels and Wasserstein Distance
In this section, we introduce the notation that will be used throughout the manuscript and provide the necessary background on graph kernel methods and on the Wasserstein distance.
2.1 Graph kernels
Kernels are a class of similarity functions that present attractive properties to be used in learning algorithms (Schölkopf and Smola, 2002). Let be a set and be a function associated with a Hilbert space , such that there exists a map with . Then, is a reproducing kernel Hilbert space (RKHS) and
is said to be a positive definite kernel. A positive definite kernel can be interpreted as a dot product in a highdimensional space, thus permitting its use in any learning algorithm that relies on dot products, such as Support Vector Machines (SVM), by virtue of the
kernel trick (Schölkopf, 2001). Because ensuring positive definiteness is not always feasible, many learning algorithms were recently proposed to extend SVMs to indefinite kernels (Ong et al., 2004; Balcan et al., 2008; Loosli et al., 2015; Oglic and Gärtner, 2018).We define a graph as a tuple , where and denote the set of nodes and edges, respectively; we further assume that the edges are undirected. Moreover, we denote the cardinality of nodes and edges for as and . For a node , we write and to denote its firstorder neighbourhood. We say that a graph is labelled if its nodes have categorical labels. A label on the nodes is a function that assigns to each node in a value from a finite label alphabet . Additionally, we say that a graph is attributed if for each node there exists an associated vector . In the following, we refer to as the node attributes and to as the categorical node labels of node . In particular, the node attributes are highdimensional continuous vectors, while the categorical node labels are assumed to be integer numbers, encoding either an ordered discrete value or a category. With the term node labels, we will implicitly refer to categorical node labels. Finally, a graph can have weighted edges and the function defines the weight of an edge .
Kernels on graphs are generally defined using the Convolution framework by Haussler (1999). The main idea is to decompose the graph into substructures and to define a kernel value as a combination of substructure similarities. For instance, the shortest path kernel (Borgwardt and Kriegel, 2005) computes each kernel value as a sum of the similarities between each shortest path in and each shortest path in . Despite the practical success of Convolution kernels, they often rely on aggregation strategies that ignore valuable information, such as the distribution of the substructures. An example is the Weisfeiler–Lehman (WL) subtree kernel (Shervashidze and Borgwardt, 2009; Shervashidze et al., 2011), which generates graphlevel features by summing the contribution of the node representations. In order to avoid these simplifications, we want to use concepts from optimal transport theory such as the Wasserstein distance; they can be used to better capture the similarities between graphs.
2.2 Wasserstein distance
The Wasserstein distance is a distance function between probability distributions defined on a given metric space. Let
and be two probability distributions on a metric space equipped with a ground distance , such as the Euclidean distance.Definition 1.
The Wasserstein distance for is defined as
(1) 
where is the set of all transportation plans over with marginals and on the first and second factors respectively.
The Wasserstein distance satisfies the axioms of a metric, provided that is a metric (see the monograph of Villani (2008), chapter 6, for a proof). Throughout the paper, we will focus on the distance for and we will refer to the Wasserstein distance when mentioning the Wasserstein distance, unless noted otherwise.
The Wasserstein distance is linked to the optimal transport problem (Villani, 2008), where the aim is to find the most “inexpensive” way, in terms of the ground distance, to transport all the probability mass from the distribution so as to match the distribution . An intuitive illustration can be made for the dimensional case, where the two probability distributions can be imagined as piles of dirt or sand. The Wasserstein distance, sometimes also referred to as the Earth Mover’s Distance (Rubner et al., 2000), can be interpreted as the minimum effort required to move the content of the first pile to reproduce the second pile.
In this paper, we deal with finite sets of node embeddings and not with continuous probability distributions. We can therefore reformulate the Wasserstein distance as a sum rather than an integral, and use the matrix notation commonly encountered in the optimal transport literature (Rubner et al., 2000) to represent the transportation plan. Given two sets of vectors and , we can equivalently define the Wasserstein distance between them as
(2) 
Here, is the distance matrix containing the distances between each element of and of , is a transport matrix (or joint probability), and is the Frobenius dot product. The transport matrix contains the fractions that indicate how to transport the values from to with the minimal total transport effort. Since we assume that the total mass to be transported equals and is evenly distributed across the elements of and , the row and column values of must sum up to and , respectively.
3 Wasserstein distance on graphs
The unsatisfactory nature of the aggregation step of current Convolution graph kernels, which may mask important substructure differences by averaging, motivated us to have a finer distance measure between structures and their components. In parallel, recent advances in optimisation solutions for faster computation of the optimal transport problem inspired us to consider this framework for the problem of graph classification. Our method relies on the following steps: (1) transform each graph into a set of node embeddings, (2) measure the Wasserstein distance between each pair of graphs, and (3) compute a similarity matrix to be used in the learning algorithm. Figure 1 provides an illustration of the first two steps, while Algorithm 1 summarises the whole procedure. We start by defining an embedding scheme and illustrate how we integrate embeddings in the Wasserstein distance.
Definition 2 (Graph Embedding Scheme).
Given a graph , a graph embedding scheme is a function that outputs a fixedsize vectorial representation for each node in the graph. For each , the th row of is called the node embedding of .
Note that Definition 2 permits treating node labels, which are categorical attributes, as onedimensional attributes with .
Definition 3 (Graph Wasserstein Distance).
Given two graphs and , a graph embedding scheme and a ground distance , we define the Graph Wasserstein Distance (GWD) as
(3) 
We will now propose a graph embedding scheme inspired by the WL kernel on categorically labeled graphs, extend it to continuously attributed graphs with weighted edges, and show how to integrate it with the GWD presented in Definition 3.
3.1 Generating node embeddings
The Weisfeiler–Lehman scheme. The Weisfeiler–Lehman subtree kernel (Shervashidze and Borgwardt, 2009; Shervashidze et al., 2011), designed for labelled nonattributed graphs, looks at similarities among subtree patterns, defined by a propagation scheme on the graphs that iteratively compares labels on the nodes and their neighbours. This is achieved by creating a sequence of ordered strings through the aggregation of the labels of a node and its neighbours; those strings are subsequently hashed to create updated compressed node labels. With increasing iterations of the algorithm, these labels represent increasingly larger neighbourhoods of each node, allowing to compare more extended substructures.
More precisely, consider a graph , let be the initial node label of for each , and let be the number of WL iterations. Then, we can define a recursive scheme to compute for by looking at the ordered set of neighbours , as
(4) 
We call this procedure the WL labelling scheme. As in the original publication (Shervashidze and Borgwardt, 2009), we use perfect hashing for the function, so nodes at iteration will have the same label if and only if their label and those of their neighbours are identical at iteration .
Extension to continuous attributes. For graphs with continuous attributes , we need to improve the WL refinement step, whose original definition prohibits handling the continuous case. The key idea is to create an explicit propagation scheme that leverages and updates the current node features by averaging over the neighbourhoods. While similar approaches have been implicitly investigated for computing nodelevel kernel similarities (Neumann et al., 2016; Morris et al., 2016), they rely on additional hashing steps on the continuous features. Moreover, we can easily account for edge weights by considering them in the average calculation of each neighbourhood. Suppose we have a continuous attribute for each node , then we recursively define
(5) 
When edge weights are not available, we set . We consider the weighted average of the neighbourhood attribute values instead of a sum and add the factor because we want to ensure a similar scale of the features across iterations; in fact, we concatenate such features for building our proposed kernel (see Definition 4
for more details) and observe better empirical results with similarlyscaled features. While not constituting a test of graph isomorphism, this refinement step can be seen as an intuitive extension for continuous attributes of the one used by the WL subtree kernel on categorical node labels, a widely successful baseline. Moreover, it resembles the propagation scheme used in many graph neural networks, which have proven to be very successful for node classification on large data sets
(Duvenaud et al., 2015; Kipf and Welling, 2017; Klicpera et al., 2019). Finally, its ability to account for edge weights makes it applicable to all types of graphs without having to perform an hashing step (Morris et al., 2016).Graph embedding scheme. Using the recursive procedure described above, we propose a WLbased graph embedding scheme that generates node embeddings from the node labels or attributes of the graphs. In the following, we use to denote the dimensionality of the node attributes (thus, for the categorical labels).
Definition 4 (WL Features).
Let and let be the number of WL iterations. Then, for every , we define the WL features as
(6) 
where for categorically labelled graphs and for continuously attributed graphs. We refer to as the node features of graph at iteration . Then, the node embeddings of graph at iteration are defined as:
(7) 
We observe that a graph can be both categorically labelled and continuously attributed, and one could extend the above scheme by jointly considering this information, for instance by concatenating the node features. However, we will leave this scenario as an extension for future work, and we thus avoid having to define an appropriate distance measure between categorical and continuous data, as this is a longstanding issue (Stevens, 1946).
3.2 Computing the Wasserstein distance
Once the node embeddings are generated by the graph embedding scheme, we evaluate the pairwise Wasserstein distance between graphs. We start by computing the ground distances between each pair of nodes. For categorical node features, we use the normalised Hamming distance, i.e.:
(8) 
The Hamming distance can be pictured as the normalised sum of the discrete metric on each of the features. The Hamming distance equals when two vectors have no features in common while it is when the vectors are identical. We use the Hamming distance as, in this case, the Weisfeiler–Lehman features are indeed categorical and values carry no meaning. For continuous node features, on the other hand, we employ the Euclidean distance
(9) 
We then plug the ground distance into the equation of Definition 1 and compute the Wasserstein distance using a network simplex method (Peyré et al., 2019).
Computational complexity. Naively, the computation of the Wasserstein Distance has a complexity of , with being the cardinality of the indexed set of node embeddings, i.e. the number of nodes in the two graphs. Nevertheless, efficient speedup tricks can be employed. In particular, approximations relying on Sinkhorn regularisation have been proposed (Cuturi, 2013), some of which reduce the computational burden to nearlinear time while preserving accuracy (Altschuler et al., 2017). Such speedup strategies become incredibly useful for larger data sets, i.e. graphs with thousands of nodes, and can be easily integrated in our method. See Appendix A.7 for a practical discussion.
4 From Wasserstein distance to kernels
From the graph Wasserstein distance, one can construct a similarity measure to be used in a learning algorithm. In this section, we propose a new graph kernel, state some claims about its (in)definiteness, and elaborate on how to use it for classifying graphs with continuous and categorical node labels.
Definition 5 (Wasserstein WeisfeilerLehman).
Given a set of graphs and the GWD defined for each pair of graph on their WL embeddings, we define the Wasserstein Weisfeiler–Lehman (WWL) kernel as:
(10) 
This is an instance of a Laplacian kernel, which was shown to offer favourable conditions for positive definiteness in case of nonEuclidean distances (Feragen et al., 2015). Obtaining the WWL kernel concludes the procedure described in Algorithm 1. In the remainder of this section, we distinguish between the categorical WWL kernel, obtained on graphs with categorical labels, and the continuous WWL kernel, obtained on continuously attributed graphs via the graph embedding schemes described in Section 3.1.
For Euclidean spaces, obtaining positive definite kernels from distance functions is a wellstudied topic (Haasdonk and Bahlmann, 2004). However, the Wasserstein distance in its general form is not isometric, i.e. there is no metricpreserving mapping, to an norm as the metric space it induces strongly depends on the chosen ground distance (Figalli and Villani, 2011). Therefore, despite being a metric, it is not necessarily possible to derive a positive definite kernel from the Wasserstein distance in its general formulation, because the classical approaches (Haasdonk and Bahlmann, 2004) cannot be applied here. Nevertheless, as a consequence of using the Laplacian kernel (Feragen et al., 2015), we can show that, in the setting of categorical node labels, the obtained kernel is positive definite.
Theorem 1.
The categorical WWL kernel is positive definite for all .
For a proof, see Sections A.1 and A.1.1 in the Appendix. By contrast, for the continuous case, establishing the definiteness of the obtained kernel remains an open problem. We refer the reader to Section A.1.2 in the supplementary materials for further discussions and conjectures.
To ensure the theoretical and practical correctness of our results, we employ recentlydeveloped methods for learning with indefinite kernels. More precisely, we utilise learning methods for Kreĭn spaces, which have been specifically designed to work with indefinite kernels (Ong et al., 2004); in general, kernels that are not positive definite induce Reproducible Kernel Kreĭn Spaces (RKKS). These spaces can be seen as a generalisation of Reproducible Kernel Hilbert Spaces (RKHS), with which they share similar mathematical properties, making them amenable to machine learning techniques. Recent algorithms (Loosli et al., 2015; Oglic and Gärtner, 2018) are capable of solving learning problems in RKKS; their results indicate that there are clear benefits (in terms of classification performance, for example) of learning in such spaces. Therefore, when evaluating WWL, we will use a Kreĭn SVM (KSVM, (Loosli et al., 2015)) as a classifier for the case of continuous attributes.
Method  MUTAG  PTCMR  NCI1  PROTEINS  DD  ENZYMES 

V  
E  
WL  
WLOA  
WWL 
Method  ENZYMES  PROTEINS  IMDBB  BZR  COX2  BZRMD  COX2MD 

VHC  
RBFWL  
HGKWL  
HGKSP  
GH  
WWL 
5 Experimental evaluation
In this section, we analyse how the performance of WWL compares with stateoftheart graph kernels. In particular, we empirically observe that WWL (1) is competitive with the best graph kernel for categorically labelled data, and (2) outperforms all the stateoftheart graph kernels for attributed graphs.
5.1 Data sets
We report results on realworld data sets from multiple sources (Borgwardt et al., 2005; Vishwanathan et al., 2010; Shervashidze et al., 2011) and use either their continuous attributes or categorical labels for evaluation. In particular, MUTAG, PTCMR, NCI1 and D&D are equipped with categorical node labels only; ENZYMES, PROTEINS have both categorical labels and continuous attributes; IMDBB, BZR and COX2 only contain continuous attributes; finally, BZRMD and COX2MD have both continuous node attributes and edge weights. Further information on the data sets are available in Supplementary Table A.1. Additionally, we also report results on synthetic data (Synthie and SYNTHETICnew) in Appendix A.5. All the data sets have been downloaded from Kersting et al. (2016).
5.2 Experimental setup
We compare WWL with stateoftheart graph kernel methods from the literature as well as relevant baselines, which we trained ourselves on the same splits (see below). In particular, for the categorical case we compare with WL (Shervashidze and Borgwardt, 2009) and WLOA (Kriege et al., 2016) as well as with the vertex (V) and edge (E) histograms. For the continuously attributed data sets, we compare with two instances of the Hash Graph Kernel (HGKSP; HGKWL) (Morris et al., 2016), and with the Graph Hopper (GH) (Feragen et al., 2013). As a further comparison partner, we use a continuous vertex histogram (VHC), which is defined as an RBF kernel between the sum of the graph node embeddings. Furthermore, to highlight the benefits of using the Wasserstein distance in our method, we replace it with an RBF kernel. More precisely, given two graphs and , with and , we first compute the Gaussian kernel between each pair of the node embeddings obtained in the same fashion as for WWL, therefore obtaining a kernel matrix between node embeddings ; then, we sum up the values = and set . This procedure is repeated for each pair of graphs to obtain the final graph kernel matrix. We refer to this baseline as RBFWL.
As a classifier, we use an SVM (or a KSVM in the case of WWL) and 10fold crossvalidation, selecting the parameters on the training set only. We repeat each crossvalidation split 10 times and report the average accuracy. We employ the same split for each evaluated method, thus guaranteeing a fully comparable setup among all evaluated methods. Please refer to Appendix A.6
for details on the hyperparameter selection.
Implementation
Available Python implementations can be used to compute the WL kernel (Sugiyama et al., 2018) and the Wasserstein distance (Flamary and Courty, 2017). We leverage these resources to develop our method and make our code publicly available. We use the original implementations provided by the respective authors to compute the WLOA, the HGK, and the GH methods.
5.3 Results and discussion
The results are evaluated via classification accuracy and summarised in Table 1 and Table 2^{1}^{1}1
The best performing methods up to the resolution implied by the standard deviation across repetitions are highlighted in boldface. Additionally, to evaluate significance we perform 2sample ttests with a 0.05 threshold and Bonferroni correction for multiple hypothesis testing within each data set, significantly outperforming methods are denoted by an asterisk.
for the categorical labels and continuous attributes, respectively.5.3.1 Categorical labels
On the categorical data sets, WWL is comparable to the WLOA kernel; it improves over the classical WL, though. In particular, WWL largely improves over WLOA in PTCMR and is slightly better on D&D, while WLOA is better on NCI1 and PROTEINS.
Unsurprisingly, our approach is comparable to the WLOA, whose main idea is to solve the optimal assignment problem by defining Dirac kernels on histograms of node labels, using multiple iterations of WL. This formulation is similar to the one we provide for categorical data, but relies on optimal assignment rather than optimal transport and therefore requires onetoone mappings instead of continuous transport maps. Beside, we solve the optimal transport problem on the concatenated embeddings, hereby jointly exploiting representations at multiple WL iterations. Contrarily, the WLOA performs optimal assignment at each iteration of WL and only combines them in a second stage. However, the key advantage of WWL over WLOA is its capacity to account for continuous attributes.
5.3.2 Continuous attributes
In this setting, WWL significantly outperforms the other methods on out of data sets, is better on another one, and is on a par on the remaining . We further compute the average rank of each method in the continuous setting, with WWL scoring as first. The ranks calculated from Table 2 are: WWL = ; HGKWL = ; RBFWL = ; HGKSP = ; VHC = . This is a remarkable improvement over the current state of the art, and it indeed establishes a new one. When looking at the average rank of the method, WWL always scores first. We thus raise the bar in kernel graph classification for attributed graphs. As mentioned in Section 4, the kernel obtained from continuous attributes is not necessarily positive definite. However, we empirically observe the kernel matrices to be positive definite (up to a numerical error), further supporting our theoretical considerations (see Appendix A.1). In practice, the difference between the results obtained from classical SVMs in RKHS and the results obtained with the KSVM approach are negligible.
Comparison with Hash Graph Kernels.
The Hash Graph Kernel (HGK) approach is somewhat related to our propagation scheme. By using multiple hashing functions, the HGK method is capable to extend certain existing graph kernels to the continuous setting. This helps avoid the limitations of perfect hashing, which cannot express small differences in continuous attributes. A drawback of the random hashing performed by HGK is that it requires additional parameters and introduces a stochastic element to the kernel matrix computation. By contrast, our propagation scheme is fully continuous and uses the Wasserstein distance to capture small differences in distributions of continuous node attributes. Moreover, the observed performance gap suggests that an entirely continuous representation of the graphs provides clear benefits over the hashing.
6 Conclusion
In this paper, we present a new family of graph kernels, the Wasserstein Weisfeiler–Lehman (WWL) graph kernels. Our experiments show that WWL graph kernels constitute the new state of the art for graph classification in the scenario of continuous node attributes, while matching the state of the art in the categorical setting. As a line of research for future work, we see a great potential in the runtime improvement, thus allowing applications of our method on regimes and data sets with larger graphs. In fact, preliminary experiments (see Section A.7 as well as Figure A.1 in the Appendix) already confirm the benefit of Sinkhorn regularisation, when the average number of nodes in the graph increases. In parallel, it would be beneficial to derive approximations of the explicit feature representations in the RKKS, as this would also provide a consistent speedup. We further envision that major theoretical contributions could be made by defining theoretical bounds to ensure the positive definiteness of the WWL kernel in the case of continuous node attributes. Finally, optimisation objectives based on optimal transport could be employed to develop new algorithms based on graph neural networks (Duvenaud et al., 2015; Kipf and Welling, 2017). On a more general level, our proposed method provides a solid foundation and highlight the large potential of optimal transport theory for machine learning.
References
 Altschuler et al. [2017] J. Altschuler, J. Weed, and P. Rigollet. Nearlinear time approximation algorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems, pages 1964–1974, 2017.
 Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Balcan et al. [2008] M.F. Balcan, A. Blum, and N. Srebro. A theory of learning with similarity functions. Machine Learning, 72(12):89–112, 2008.
 Berg et al. [1984] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100. Springer, 1984.
 Borgwardt and Kriegel [2005] K. M. Borgwardt and H.P. Kriegel. Shortestpath kernels on graphs. In Fifth IEEE International Conference on Data Mining, page 8. IEEE, 2005.
 Borgwardt et al. [2005] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21:i47–i56, 2005.
 Bridson and Häfliger [2013] M. R. Bridson and A. Häfliger. Metric spaces of nonpositive curvature. Springer, Heidelberg, Germany, 2013.
 Cuturi [2013] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
 Duvenaud et al. [2015] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 Feragen et al. [2013] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt. Scalable kernels for graphs with continuous attributes. In Advances in Neural Information Processing Systems, pages 216–224, 2013.

Feragen et al. [2015]
A. Feragen, F. Lauze, and S. Hauberg.
Geodesic exponential kernels: When curvature and linearity conflict.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3032–3042, 2015.  Figalli and Villani [2011] A. Figalli and C. Villani. Optimal transport and curvature. In Nonlinear PDE’s and Applications, pages 171–217. Springer, 2011.
 Flamary and Courty [2017] R. Flamary and N. Courty. Pot python optimal transport library, 2017. URL https://github.com/rflamary/POT.
 Frogner et al. [2015] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
 Fröhlich et al. [2005] H. Fröhlich, J. K. Wegner, F. Sieker, and A. Zell. Optimal assignment kernels for attributed molecular graphs. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 225–232, New York, NY, USA, 2005. ACM.
 Gardner et al. [2017] A. Gardner, C. A. Duncan, J. Kanno, and R. R. Selmic. On the definiteness of earth mover’s distance and its relation to set intersection. IEEE Transactions on Cybernetics, 2017.
 Haasdonk and Bahlmann [2004] B. Haasdonk and C. Bahlmann. Learning with distance substitution kernels. In Joint Pattern Recognition Symposium, pages 220–227. Springer, 2004.
 Haussler [1999] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California, 1999.
 Kersting et al. [2016] K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tudortmund.de.
 Kipf and Welling [2017] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 Klicpera et al. [2019] J. Klicpera, A. Bojchevski, and S. Günnemann. Combining neural networks with personalized pagerank for classification on graphs. In International Conference on Learning Representations (ICLR), 2019.
 Kolouri et al. [2016] S. Kolouri, Y. Zou, and G. K. Rohde. Sliced wasserstein kernels for probability distributions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5258–5267, 2016.
 Kriege and Mutzel [2012] N. Kriege and P. Mutzel. Subgraph matching kernels for attributed graphs. arXiv preprint arXiv:1206.6483, 2012.
 Kriege et al. [2016] N. M. Kriege, P.L. Giscard, and R. C. Wilson. On valid optimal assignment kernels and applications to graph classification. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1623–1631, 2016.
 Loosli et al. [2015] G. Loosli, S. Canu, and C. S. Ong. Learning svm in kreĭn spaces. IEEE transactions on pattern analysis and machine intelligence, 38(6):1204–1216, 2015.
 Morris et al. [2016] C. Morris, N. M. Kriege, K. Kersting, and P. Mutzel. Faster kernels for graphs with continuous attributes via hashing. In IEEE 16th International Conference on Data Mining (ICDM), pages 1095–1100, 2016.
 Neumann et al. [2016] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient graph kernels from propagated information. Machine Learning, 102(2):209–245, 2016.
 Oglic and Gärtner [2018] D. Oglic and T. Gärtner. Learning in reproducing kernel kreın spaces. In Proceedings of the thirtyfifth International Conference on Machine Learning, 2018.
 Ong et al. [2004] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with nonpositive kernels. In Proceedings of the twentyfirst International Conference on Machine Learning. ACM, 2004.
 Peyré et al. [2019] G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends® in Machine Learning, 11(56):355–607, 2019.
 Rabin et al. [2011] J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.

Rubner et al. [2000]
Y. Rubner, C. Tomasi, and L. J. Guibas.
The earth mover’s distance as a metric for image retrieval.
International journal of computer vision, 40(2):99–121, 2000.  Schölkopf [2001] B. Schölkopf. The kernel trick for distances. In Advances in neural information processing systems, pages 301–307, 2001.
 Schölkopf and Smola [2002] B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 Shervashidze and Borgwardt [2009] N. Shervashidze and K. M. Borgwardt. Fast subtree kernels on graphs. In Advances in neural information processing systems, 2009.
 Shervashidze et al. [2011] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
 ShinIchi [2012] O. ShinIchi. Barycenters in alexandrov spaces of curvature bounded below. In Advances in Geometry, volume 14. De Gruyter, 2012.
 Stevens [1946] S. S. Stevens. On the theory of scales of measurement. Science, 1946.
 Sugiyama et al. [2018] M. Sugiyama, M. E. Ghisu, F. LlinaresLópez, and K. Borgwardt. graphkernels: R and python packages for graph comparison. Bioinformatics, 34(3):530–532, 2018.
 Turner et al. [2014] K. Turner, Y. Mileyko, S. Mukherjee, and J. Harer. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 52:44–70, Jul 2014.
 Vert [2008] J.P. Vert. The optimal assignment kernel is not positive definite. arXiv preprint arXiv:0801.4061, 2008.
 Villani [2008] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 Vishwanathan et al. [2010] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
 Xu et al. [2019] H. Xu, D. Luo, H. Zha, and L. Carin. Gromovwasserstein learning for graph matching and node embedding. arXiv preprint arXiv:1901.06003, 2019.
 Yanardag and Vishwanathan [2015] P. Yanardag and S. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.
Appendix A Appendix
a.1 Extended considerations on WWL definiteness
We will now discuss on the positive definite nature of our WWL kernel.
In general, whether distances obtained from optimal transport problems can be used to create positive definite kernels remains an open research question. Several attempts to draw general conclusions on the definiteness of the Wasserstein distance were unsuccessful, but insightful results on particular cases were obtained along the way. Here, we first collect some of these contributions and use them to prove that our WWL Kernel for categorical embeddings is positive definite. We then elaborate further on the continuous embeddings case, for which we provide conjectures on practical conditions to obtain a pd kernel.
Before proceeding, let us remind some useful notions.
Definition 6.
Schölkopf and Smola [2002] A symmetric function is called a positive definite (pd) kernel if it satisfies the condition
(11) 
for every , and .
The matrix of kernel values with entries is called the Gram matrix of with respect to . A conditional positive definite (cpd) kernel is a function that satisfies Equation 11 for all with , whereas a conditional negative definite kernel is a function that satisfies for all with .
For Euclidean spaces, obtaining kernels from distance functions is a wellstudied topic.
Proposition 1.
[Haasdonk and Bahlmann, 2004] Let be a symmetric, nonnegative distance function with . If is isometric to an norm, then
(12) 
is a valid cpd kernel.
However, the Wasserstein distance in its general form is not isometric to an norm as the metric space it induces strongly depends on the chosen ground distance [Figalli and Villani, 2011]. More recently, Feragen et al. [2015] argued that many types of data, including probability distributions, do not always reside in Euclidean spaces. Therefore, they define the family of exponential kernels relying on a nonEuclidean distance , as follows
(13) 
and, based on earlier considerations from Berg et al. [1984], show that, under certain conditions, the Laplacian kernel ( in Equation 13) is positive definite.
Proposition 2.
[Feragen et al., 2015] The geodesic Laplacian kernel is positive definite for all if and only if the geodesic distance is conditional negative definite.
Once again, considerations on the negative definiteness of Wasserstein distance functions cannot be made on the general level. Certain ground distances, however, guarantee the negative definiteness of the resulting Wasserstein distance. In particular, the Wasserstein distance with the discrete metric (i.e. in Equation 8) as the ground distance, was proved to be conditional negative definite [Gardner et al., 2017].
We will now leverage these results to prove that the Wasserstein distance equipped with the Hamming ground distance is conditional negative definite, therefore it yields positive definite kernels for the categorical WL embeddings.
a.1.1 The case of categorical embeddings
When generating node embeddings using the Weisfeiler–Lehman labelling scheme with a shared dictionary across all the graphs, the solutions to the optimal transport problem are also shared across iterations. We denote the Weisfeiler–Lehman embedding scheme as defined in Definition 4 as , and let be the corresponding GWD on a set of graphs with categorical labels. Let of Equation 8 be the ground distance of . Then, the following useful results hold.
Lemma 1.
If a transportation plan with transport matrix is optimal in the sense of Definition 1 for distances between embeddings obtained with , then it is also optimal for the discrete distances between the th iteration values obtained with the Weisfeiler–Lehman procedure.
Proof. See Appendix A.2
Lemma 2.
If a transportation plan with transport matrix is optimal in the sense of Definition 1 for distances between embeddings obtained with , then it is also optimal for distances between embeddings obtained with .
Proof. See Appendix A.3
We therefore postulate that the Wasserstein distance between categorical WL node embeddings is a conditional negative definite function.
Theorem 2.
is a conditional negative definite function.
Proof. See Appendix A.4
Proof of Theorem 1. Theorem 2 in light of Proposition 2 implies that the WWL kernel of Definition 5 is positive definite for all .
We will now consider the case of the definiteness of kernels in the continuous setting.
a.1.2 The case of continuous embeddings
On one hand, in the categorical case, we proved the positive definiteness of our kernel. On the other hand, the continuous case is considerably harder to tackle. We conjecture that, under certain conditions, the same might hold for continuous features. While we do not have a formal proof yet, in what follows, we discuss arguments to support this conjecture which seems to agree with our empirical findings.^{2}^{2}2We observe that, for all considered data sets and after standardisation of the input features prior to the embedding scheme, GWD matrices are conditional negative definite.
The curvature of the metric space induced by the Wasserstein metric for a given ground distance plays an important role here. We first need to define Alexandrov spaces.
Definition 7 (Alexandrov space).
Given a metric space and a real number , the space is called an Alexandrov space if its sectional curvature is .
Roughly speaking, the curvature indicates to what extent a geodesic triangle will be deformed in the space. The case of is special as no distortion is happening here—hence, spaces that satisfy this property are called flat. The concept of Alexandrov spaces is required in the following proposition, taken from a theorem by Feragen et al. [2015], which shows the relationship between a kernel and its underlying metric space.
Proposition 3.
The geodesic Gaussian kernel (i.e. in Equation 13) is positive definite for all if and only if the underlying metric space is flat in the sense of Alexandrov, i.e. if any geodesic triangle in can be isometrically embedded in a Euclidean space.
However, it is unlikely that the space induced by the Wasserstein distance is locally flat, as not even the geodesics (i.e. a generalization of a shortest path to arbitrary metric spaces) between graph embeddings are necessarily unique, as we shall show below. Hence, we use the geodesic Laplacian kernel instead of the Gaussian one because it poses less strict requirements on the induced space, as stated in Proposition 2. More precisely, the metric used in the kernel function needs to be cnd. We cannot directly prove this yet, but we can provide a proof that the converse is not true. To this end, we first notice that the metric space induced by the GWD, which we shall here refer to as , does not have a curvature that is bounded from above.
Definition 8.
A metric space is said to be if its curvature is bounded by some real number from above. This can also be seen a “relaxed” definition, or generalisation, of a Riemannian manifold.
Theorem 3.
is not in for any , meaning that its curvature is not bounded by any from above.
Proof.
This follows from a similar argument as presented by Turner et al. [2014]. We briefly sketch the argument. Let and be two graphs. Assume that is a space for some . Then it follows [Bridson and Häfliger, 2013, Proposition 2.11, p. 23] that if , there is a unique geodesic between them. However, we can construct a family of graph embeddings for which this is not the case. To this end, let and and be two graph embeddings with node embeddings , as well as and , respectively. Since we use the Euclidean distance as a ground distance, there will be two optimal transport plans: the first maps to and to , while the second maps to and to . Hence, we have found two geodesics that connect and . Since we may choose to be arbitrarily small, the space cannot be for . ∎
While this means that we do not have a simple upper bound on the curvature, we have the following conjecture.
Conjecture 1.
is an Alexandrov space with curvature bounded from below by zero.
For a proof idea, we refer to Turner et al. [2014]; the main argument involves characterizing the distance between triples of graph embeddings. This conjecture is helpful insofar as being a nonnegativelycurved Alexandrov space is a necessary prerequisite for to be a Hilbert space [ShinIchi, 2012]. In turn, Feragen et al. [2015] shows that cnd metrics and Hilbert spaces are intricately linked. We thus have some hope in obtaining a cnd metric, even though we lack a proof as of now. Our empirical results, however, indicate that it is possible to turn the GWD into a cnd metric, provided a proper normalisation takes place. Intuitively, for high dimensional input spaces, standardisation of input features changes the curvature of the induced space by making it locally (almost) flat.
To support this argumentation, we refer to an existing way to ensure positive definiteness. One can use an alternative to the classical Wasserstein distance denoted as the sliced Wasserstein [Rabin et al., 2011]. The idea is to project high dimensional ditributions into onedimensional spaces, hereby calculating the Wasserstein distance as a combination of one dimensional representations. Kolouri et al. [2016] showed that each of the onedimensional Wasserstein distances are conditional negative definite. The kernel on highdimensional representations is then defined as a combination of the onedimensional positive definite counterparts.
a.2 Proof of Lemma 1
Proof.
We recall the matrix notation introduced in Equation 2 of the main paper, where is the cost or distance matrix, is a transport matrix (or joint probability), and is the Frobenius dot product. Since we give equal weight (i.e. equal probability mass) to each of the vectors in each set, contains all nonnegative matrices with
For notation simplicity, let us denote by the Hamming matrix , where the th entry is given by the Hamming distance between the embedding of the th node of graph and the embedding of the th node of graph at iteration . Similarly, we define to be the discrete metric distance matrix, where the th entry is given by the discrete distance between feature of node embedding of graph and feature of node embedding of graph . It is easy to see that while and, by definition,
Moreover, due to the WL procedure, two labels that are different at iteration will also be different at iteration . Hence, the following identity holds
Which implies that . An optimal transportation plan for embeddings satisfies
Assuming that is not optimal for , we can define such that
Since the entries of are either or , we can define the set of indices tuples and rewrite the inequality as
Due to the constraints on the entries of and , namely , this implies that, by rearranging the transport map, there is more mass that could be transported at cost. In our formalism:
However, as stated before, entries of that are are also in and a better transport plan would therefore also be optimal for :
Which contradicts the optimality assumption above. Hence, is also optimal for .
∎
a.3 Proof of Lemma 2
Proof.
Intuitively, the transportation plan at iteration is a “refinement” of the transportation plan at iteration , where only a subset of the optimal transportation plans remain optimal for the new cost matrix . Using the same notation as for the Proof in Appendix A.2, and due to the WL procedure, two labels that are different at iteration will also be different at iteration . Hence, the following identities hold
An optimal transportation plan for embeddings satisfies
Which can also be written as
The values of increase in a stepwise fashion for increasing and their ordering remains constant except for entries that were at iteration and become at iteration . Hence, since our metric distance matrices satisfy monotonicity conditions and because is optimal for according to Lemma 1, it follows that
Hence, is also optimal for embeddings. ∎
a.4 Proof of Theorem 2
Proof.
Using the same notation as for the Proof in Appendix A.2 and the formulation in Equation 2, we can write
Let be an optimal solution for iteration then, from Lemmas 1 and 2, it is also an optimal solution for as well as for all and we can rewrite the equation as a sum of optimal transport problems
(14) 
This corresponds to a sum of 1dimensional optimal transport problems relying on the discrete metric, which were shown to be conditional negative functions [Gardner et al., 2017]. The final sum is therefore conditional negative definite as well. ∎
a.5 Data sets and additional results
We report additional information on the data sets used in our experimental comparison in Supplementary Table A.1
Data set  Class  Node  Node  Edge  Graphs  Classes 
Ratio  Labels  Attributes  Weights  
MUTAG  ✓      
NCI1  ✓      
PTCMR  ✓      
DD  ✓      
ENZYMES  per class  ✓  ✓    
PROTEINS  ✓  ✓    
BZR  ✓  ✓    
COX2  ✓  ✓    
SYNTHIE  per class    ✓    
IMDBBINARY    (✓)    
SYNTHETICNEW    ✓    
BzRMD  ✓    ✓  
COX2MD  ✓    ✓ 
Our data sets belong to multiple chemoinformatics domains, including small molecules (MUTAG, PTCMR, NCI1), macromolecules (ENZYMES, PROTEINS, D&D) and chemical compounds (BZR, COX2). We further consider a movie collaboration dataset (IMDB, see [Yanardag and Vishwanathan, 2015] for a desription) and two synthetic data sets Synthie and Syntheticnew, created by Morris et al. [2016] and Feragen et al. [2013], respectively. The BZRMD and COX2MD data sets do not have node attributes but contain the atomic distance between each connected atom as an edge weight. We do not consider distances between nonconnected nodes [Kriege and Mutzel, 2012]
and we equip the node with onehotencoding categorical attribtues representing the atom type, i.e. what is originally intended as a categorical node label. On
IMDBB, IMDBBINARY was used with the node degree as a (semi)continuous feature for each node [Yanardag and Vishwanathan, 2015]. For all the other data sets, we use the offtheshelf version provided by Kersting et al. [2016].Results on synthetic data sets are provided in Table A.2. We decided not to include those in the main manuscript because of the very unstable and unreliable performances we obtained. In particular, for both data sets there is a very high variation among the different methods. Furthermore, we experimentally observed that even a slight modification of the node features (e.g. normalisation or scaling of the embedding scheme) resulted in large change in performances (up to ). Additionally, it has been previously reported [Morris et al., 2016, Feragen et al., 2013] that on Syntheticnew, a WL with degree treated as categorical node label outperforms the competitors, suggesting that the continuous attribtues are indeed not very informative. Therefore, we excluded these data sets from the main manuscript, as we concluded that they are unsuitable to provide a fair assessment of methods quality.
Method  SYNTHIE  SYNTHETICnew 

VHC  
RBFWL  
HGKWL  
HGKSP  
GH  
WWL 
a.6 Details on hyperparameter selection
The following ranges are used for the hyperparameter selection: the parameter of the SVM (for continuous attributes) and (for categorical attributes); the WL number of iterations ; the parameter of the WWL . For the RBFWL and VHC we use default parameter for the Gaussian kernel, i.e , where is the size of node attributes. For the GH kernel, we also fix the parameter to , and for HGK we fix the number of iterations to for each data set except for SYNTHETICnew, where we use (these setups were suggested by the respective authors [Morris et al., 2016, Feragen et al., 2013]. Furthermore, since HGK is a randomised method, we compute each kernel matrix times and average the results. When the dimensionality of the continuous attributes , these are normalised to ensure comparability among the different feature scales, in each data set but BZR and COX2, due to the meaning of the node attributes being location coordinates.
a.7 Runtime comparison
Overall, we note that WL and WLOA scale linearly with the number of nodes, and will therefore be faster than our approach. Due to the differences in programming language implementations of the different methods, it is hard to provide an accurate runtime comparison. However, we empirically observe that the Wasserstein Graph Kernels are still competitive and a kernel matrix can be computed in a median time of s, depending on the size and number of graphs (see Figure A.1). For the continuous attributes, our approach has a runtime in the same order of magnitude as the GH. However, while our approach can benefit from significant speedup (see discussion below and Section 5.2), the GH was shown to empirically scale quadratically with the number of nodes in the graphs [Feragen et al., 2013]. The HGK, on the other hand, is considerably slower given the number of iterations and multiple repetitions needed while taking into account the randomisation.
To evaluate our approach with respect to the size of the graphs and recalling that computing the Wasserstein distance has complexity , we simulated a fixed number of graphs with a varying average number of nodes per graph. We generated random node embeddings for
where the number of nodes is taken from a normal distribution centered around the average number of nodes. We then computed the kernel matrix on each set of graphs to compare the runtime of regular Wasserstein with the Sinkhorn regularized optimization. As shown in Supplementary Figure
A.1, the speedup starts to become beneficial at around 100 nodes per graph on average, which is larger than the average number of nodes in the benchmark data sets we used.In order to ensure that good performance is maintained when using the Sinkhorn approximation, we evaluate the obtained accuracy of the model. Recalling that the Sinkhorn method solves the following entropic regularization problem:
we further need to select . Therefore, on top of the crossvalidation scheme described above, we further crossvalidate over the regularization parameter values of for the Enzymes data set and obtain an accuracy of , which remains above the current stateoftheart. The ’s selected most of the time are , and .