Wasserstein Weisfeiler-Lehman Graph Kernels

06/04/2019 ∙ by Matteo Togninalli, et al. ∙ ETH Zurich 0

Graph kernels are an instance of the class of R-Convolution kernels, which measure the similarity of objects by comparing their substructures. Despite their empirical success, most graph kernels use a naive aggregation of the final set of substructures, usually a sum or average, thereby potentially discarding valuable information about the distribution of individual components. Furthermore, only a limited instance of these approaches can be extended to continuously attributed graphs. We propose a novel method that relies on the Wasserstein distance between the node feature vector distributions of two graphs, which allows to find subtler differences in data sets by considering graphs as high-dimensional objects, rather than simple means. We further propose a Weisfeiler-Lehman inspired embedding scheme for graphs with continuous node attributes and weighted edges, enhance it with the computed Wasserstein distance, and thus improve the state-of-the-art prediction performance on several graph classification tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph-structured data have become ubiquitous across domains over the last decades, with examples ranging from social and sensor networks, to chemo- and bioinformatics. Graph kernels (Vishwanathan et al., 2010) have been very successful in dealing with the complexity of graphs and have shown good predictive performance on a variety of classification problems (Shervashidze et al., 2011; Morris et al., 2016; Yanardag and Vishwanathan, 2015). Most graph kernels rely on the -Convolution framework (Haussler, 1999), which decomposes structured objects into substructures to compute local similarities that are then aggregated. While being generally successful for several applications, -Convolution kernels on graphs are not without limitations: (1) the simplicity of the way in which the similarities between substructures are aggregated might limit their ability to capture complex nonlinear structural characteristics of the graph; (2) most proposed variants do not generalise to graphs with high-dimensional continuous node attributes, and extensions are far from being straightforward. Various solutions have been proposed in order to address point (1). For example, Fröhlich et al. (2005) introduced kernels based on optimal assignment of node labels for molecular graphs, even though these kernels are not positive definite (Vert, 2008). More recently, another approach was proposed by Kriege et al. (2016) which set a new state-of-the-art performance in graph classification on categorical node labels. However, this formulation cannot handle continuous nodes attributes, leaving point (2) as an open problem.

To overcome both limitations, we propose a method that combines the most successful vectorial graph-representations derived from the graph kernel literature with ideas from optimal transport theory, which have recently gained considerable attention. In particular, improvements of the computational strategies to efficiently obtain Wasserstein distances (Cuturi, 2013; Altschuler et al., 2017)

have led to many applications in machine learning that use it for various purposes, ranging from generative models 

(Arjovsky et al., 2017)

to new loss functions 

(Frogner et al., 2015). For applications to graphs, notions from optimal transport were used to tackle the graph alignment problem (Xu et al., 2019). In this work, we provide the theoretical foundations of our method, then define a new graph kernel formulation and finally present successful experimental results. More precisely, our main contributions can be summarised as follows:

  • We present the graph Wasserstein distance, a new distance between graphs based on their node feature representations and we discuss how kernels can be derived from it;

  • We introduce a Weisfeiler–Lehman inspired embedding scheme that works for both categorically labelled and continuously attributed graphs and couple it with our graph Wasserstein distance;

  • We establish a new state of the art for graph kernels on traditional graph classification benchmarks with continuous attributes.

2 Background: Graph Kernels and Wasserstein Distance

In this section, we introduce the notation that will be used throughout the manuscript and provide the necessary background on graph kernel methods and on the Wasserstein distance.

2.1 Graph kernels

Kernels are a class of similarity functions that present attractive properties to be used in learning algorithms (Schölkopf and Smola, 2002). Let be a set and be a function associated with a Hilbert space , such that there exists a map with . Then, is a reproducing kernel Hilbert space (RKHS) and

is said to be a positive definite kernel. A positive definite kernel can be interpreted as a dot product in a high-dimensional space, thus permitting its use in any learning algorithm that relies on dot products, such as Support Vector Machines (SVM), by virtue of the

kernel trick (Schölkopf, 2001). Because ensuring positive definiteness is not always feasible, many learning algorithms were recently proposed to extend SVMs to indefinite kernels (Ong et al., 2004; Balcan et al., 2008; Loosli et al., 2015; Oglic and Gärtner, 2018).

We define a graph as a tuple , where and denote the set of nodes and edges, respectively; we further assume that the edges are undirected. Moreover, we denote the cardinality of nodes and edges for as and . For a node , we write and to denote its first-order neighbourhood. We say that a graph is labelled if its nodes have categorical labels. A label on the nodes is a function that assigns to each node in a value from a finite label alphabet . Additionally, we say that a graph is attributed if for each node there exists an associated vector . In the following, we refer to as the node attributes and to as the categorical node labels of node . In particular, the node attributes are high-dimensional continuous vectors, while the categorical node labels are assumed to be integer numbers, encoding either an ordered discrete value or a category. With the term node labels, we will implicitly refer to categorical node labels. Finally, a graph can have weighted edges and the function defines the weight of an edge .

Kernels on graphs are generally defined using the -Convolution framework by Haussler (1999). The main idea is to decompose the graph into substructures and to define a kernel value as a combination of substructure similarities. For instance, the shortest path kernel (Borgwardt and Kriegel, 2005) computes each kernel value as a sum of the similarities between each shortest path in and each shortest path in . Despite the practical success of -Convolution kernels, they often rely on aggregation strategies that ignore valuable information, such as the distribution of the substructures. An example is the Weisfeiler–Lehman (WL) subtree kernel (Shervashidze and Borgwardt, 2009; Shervashidze et al., 2011), which generates graph-level features by summing the contribution of the node representations. In order to avoid these simplifications, we want to use concepts from optimal transport theory such as the Wasserstein distance; they can be used to better capture the similarities between graphs.

2.2 Wasserstein distance

The Wasserstein distance is a distance function between probability distributions defined on a given metric space. Let

and be two probability distributions on a metric space equipped with a ground distance , such as the Euclidean distance.

Definition 1.

The -Wasserstein distance for is defined as


where is the set of all transportation plans over with marginals and on the first and second factors respectively.

The Wasserstein distance satisfies the axioms of a metric, provided that is a metric (see the monograph of Villani (2008), chapter 6, for a proof). Throughout the paper, we will focus on the distance for and we will refer to the -Wasserstein distance when mentioning the Wasserstein distance, unless noted otherwise.

The Wasserstein distance is linked to the optimal transport problem (Villani, 2008), where the aim is to find the most “inexpensive” way, in terms of the ground distance, to transport all the probability mass from the distribution so as to match the distribution . An intuitive illustration can be made for the -dimensional case, where the two probability distributions can be imagined as piles of dirt or sand. The Wasserstein distance, sometimes also referred to as the Earth Mover’s Distance (Rubner et al., 2000), can be interpreted as the minimum effort required to move the content of the first pile to reproduce the second pile.

In this paper, we deal with finite sets of node embeddings and not with continuous probability distributions. We can therefore reformulate the Wasserstein distance as a sum rather than an integral, and use the matrix notation commonly encountered in the optimal transport literature (Rubner et al., 2000) to represent the transportation plan. Given two sets of vectors and , we can equivalently define the Wasserstein distance between them as


Here, is the distance matrix containing the distances between each element of and of , is a transport matrix (or joint probability), and is the Frobenius dot product. The transport matrix contains the fractions that indicate how to transport the values from to with the minimal total transport effort. Since we assume that the total mass to be transported equals and is evenly distributed across the elements of and , the row and column values of must sum up to and , respectively.

3 Wasserstein distance on graphs

The unsatisfactory nature of the aggregation step of current -Convolution graph kernels, which may mask important substructure differences by averaging, motivated us to have a finer distance measure between structures and their components. In parallel, recent advances in optimisation solutions for faster computation of the optimal transport problem inspired us to consider this framework for the problem of graph classification. Our method relies on the following steps: (1) transform each graph into a set of node embeddings, (2) measure the Wasserstein distance between each pair of graphs, and (3) compute a similarity matrix to be used in the learning algorithm. Figure 1 provides an illustration of the first two steps, while Algorithm 1 summarises the whole procedure. We start by defining an embedding scheme and illustrate how we integrate embeddings in the Wasserstein distance.

Definition 2 (Graph Embedding Scheme).

Given a graph , a graph embedding scheme is a function that outputs a fixed-size vectorial representation for each node in the graph. For each , the -th row of is called the node embedding of .

Note that Definition 2 permits treating node labels, which are categorical attributes, as one-dimensional attributes with .

Definition 3 (Graph Wasserstein Distance).

Given two graphs and , a graph embedding scheme and a ground distance , we define the Graph Wasserstein Distance (GWD) as


We will now propose a graph embedding scheme inspired by the WL kernel on categorically labeled graphs, extend it to continuously attributed graphs with weighted edges, and show how to integrate it with the GWD presented in Definition 3.

Figure 1: Visual summary of the Graph Wasserstein Distance. First, generates embeddings for the two input graphs and . Then, the Wasserstein distance between the embedding distributions is computed.

3.1 Generating node embeddings

The Weisfeiler–Lehman scheme. The Weisfeiler–Lehman subtree kernel (Shervashidze and Borgwardt, 2009; Shervashidze et al., 2011), designed for labelled non-attributed graphs, looks at similarities among subtree patterns, defined by a propagation scheme on the graphs that iteratively compares labels on the nodes and their neighbours. This is achieved by creating a sequence of ordered strings through the aggregation of the labels of a node and its neighbours; those strings are subsequently hashed to create updated compressed node labels. With increasing iterations of the algorithm, these labels represent increasingly larger neighbourhoods of each node, allowing to compare more extended substructures.

More precisely, consider a graph , let be the initial node label of for each , and let be the number of WL iterations. Then, we can define a recursive scheme to compute for by looking at the ordered set of neighbours , as


We call this procedure the WL labelling scheme. As in the original publication (Shervashidze and Borgwardt, 2009), we use perfect hashing for the function, so nodes at iteration will have the same label if and only if their label and those of their neighbours are identical at iteration .

Extension to continuous attributes. For graphs with continuous attributes , we need to improve the WL refinement step, whose original definition prohibits handling the continuous case. The key idea is to create an explicit propagation scheme that leverages and updates the current node features by averaging over the neighbourhoods. While similar approaches have been implicitly investigated for computing node-level kernel similarities (Neumann et al., 2016; Morris et al., 2016), they rely on additional hashing steps on the continuous features. Moreover, we can easily account for edge weights by considering them in the average calculation of each neighbourhood. Suppose we have a continuous attribute for each node , then we recursively define


When edge weights are not available, we set . We consider the weighted average of the neighbourhood attribute values instead of a sum and add the factor because we want to ensure a similar scale of the features across iterations; in fact, we concatenate such features for building our proposed kernel (see Definition 4

for more details) and observe better empirical results with similarly-scaled features. While not constituting a test of graph isomorphism, this refinement step can be seen as an intuitive extension for continuous attributes of the one used by the WL subtree kernel on categorical node labels, a widely successful baseline. Moreover, it resembles the propagation scheme used in many graph neural networks, which have proven to be very successful for node classification on large data sets 

(Duvenaud et al., 2015; Kipf and Welling, 2017; Klicpera et al., 2019). Finally, its ability to account for edge weights makes it applicable to all types of graphs without having to perform an hashing step (Morris et al., 2016).

Graph embedding scheme. Using the recursive procedure described above, we propose a WL-based graph embedding scheme that generates node embeddings from the node labels or attributes of the graphs. In the following, we use to denote the dimensionality of the node attributes (thus, for the categorical labels).

Definition 4 (WL Features).

Let and let be the number of WL iterations. Then, for every , we define the WL features as


where for categorically labelled graphs and for continuously attributed graphs. We refer to as the node features of graph at iteration . Then, the node embeddings of graph at iteration are defined as:


We observe that a graph can be both categorically labelled and continuously attributed, and one could extend the above scheme by jointly considering this information, for instance by concatenating the node features. However, we will leave this scenario as an extension for future work, and we thus avoid having to define an appropriate distance measure between categorical and continuous data, as this is a long-standing issue (Stevens, 1946).

3.2 Computing the Wasserstein distance

Once the node embeddings are generated by the graph embedding scheme, we evaluate the pairwise Wasserstein distance between graphs. We start by computing the ground distances between each pair of nodes. For categorical node features, we use the normalised Hamming distance, i.e.:


The Hamming distance can be pictured as the normalised sum of the discrete metric on each of the features. The Hamming distance equals when two vectors have no features in common while it is when the vectors are identical. We use the Hamming distance as, in this case, the Weisfeiler–Lehman features are indeed categorical and values carry no meaning. For continuous node features, on the other hand, we employ the Euclidean distance


We then plug the ground distance into the equation of Definition 1 and compute the Wasserstein distance using a network simplex method (Peyré et al., 2019).

Computational complexity. Naively, the computation of the Wasserstein Distance has a complexity of , with being the cardinality of the indexed set of node embeddings, i.e. the number of nodes in the two graphs. Nevertheless, efficient speedup tricks can be employed. In particular, approximations relying on Sinkhorn regularisation have been proposed (Cuturi, 2013), some of which reduce the computational burden to near-linear time while preserving accuracy (Altschuler et al., 2017). Such speed-up strategies become incredibly useful for larger data sets, i.e. graphs with thousands of nodes, and can be easily integrated in our method. See Appendix A.7 for a practical discussion.

  Input: Two graphs , ; graph embedding scheme ; ground distance ; .
  Output: kernel value .
   // Generate node embeddings
   // Compute the ground distance between each pair of nodes
   // Compute the Wasserstein distance
Algorithm 1 Compute Wasserstein Graph Kernel

4 From Wasserstein distance to kernels

From the graph Wasserstein distance, one can construct a similarity measure to be used in a learning algorithm. In this section, we propose a new graph kernel, state some claims about its (in)definiteness, and elaborate on how to use it for classifying graphs with continuous and categorical node labels.

Definition 5 (Wasserstein Weisfeiler-Lehman).

Given a set of graphs and the GWD defined for each pair of graph on their WL embeddings, we define the Wasserstein Weisfeiler–Lehman (WWL) kernel as:


This is an instance of a Laplacian kernel, which was shown to offer favourable conditions for positive definiteness in case of non-Euclidean distances (Feragen et al., 2015). Obtaining the WWL kernel concludes the procedure described in Algorithm 1. In the remainder of this section, we distinguish between the categorical WWL kernel, obtained on graphs with categorical labels, and the continuous WWL kernel, obtained on continuously attributed graphs via the graph embedding schemes described in Section 3.1.

For Euclidean spaces, obtaining positive definite kernels from distance functions is a well-studied topic (Haasdonk and Bahlmann, 2004). However, the Wasserstein distance in its general form is not isometric, i.e. there is no metric-preserving mapping, to an -norm as the metric space it induces strongly depends on the chosen ground distance (Figalli and Villani, 2011). Therefore, despite being a metric, it is not necessarily possible to derive a positive definite kernel from the Wasserstein distance in its general formulation, because the classical approaches (Haasdonk and Bahlmann, 2004) cannot be applied here. Nevertheless, as a consequence of using the Laplacian kernel (Feragen et al., 2015), we can show that, in the setting of categorical node labels, the obtained kernel is positive definite.

Theorem 1.

The categorical WWL kernel is positive definite for all .

For a proof, see Sections A.1 and A.1.1 in the Appendix. By contrast, for the continuous case, establishing the definiteness of the obtained kernel remains an open problem. We refer the reader to Section A.1.2 in the supplementary materials for further discussions and conjectures.

To ensure the theoretical and practical correctness of our results, we employ recently-developed methods for learning with indefinite kernels. More precisely, we utilise learning methods for Kreĭn spaces, which have been specifically designed to work with indefinite kernels (Ong et al., 2004); in general, kernels that are not positive definite induce Reproducible Kernel Kreĭn Spaces (RKKS). These spaces can be seen as a generalisation of Reproducible Kernel Hilbert Spaces (RKHS), with which they share similar mathematical properties, making them amenable to machine learning techniques. Recent algorithms (Loosli et al., 2015; Oglic and Gärtner, 2018) are capable of solving learning problems in RKKS; their results indicate that there are clear benefits (in terms of classification performance, for example) of learning in such spaces. Therefore, when evaluating WWL, we will use a Kreĭn SVM (KSVM, (Loosli et al., 2015)) as a classifier for the case of continuous attributes.

Table 1: Classification accuracies on graph with categorical node labels. Comparison of Weisfeiler–Lehman Kernel (WL); Optimal Assignment Kernel (WL-OA); and our method WWL.
Table 2: Classification accuracies on graphs with continuous node and/or edge attributes. Comparison of Hash-Graphs Kernel (HGK-WL, HGK-SP); Graph Hopper Kernel (GH); and our method WWL.

5 Experimental evaluation

In this section, we analyse how the performance of WWL compares with state-of-the-art graph kernels. In particular, we empirically observe that WWL (1) is competitive with the best graph kernel for categorically labelled data, and (2) outperforms all the state-of-the-art graph kernels for attributed graphs.

5.1 Data sets

We report results on real-world data sets from multiple sources (Borgwardt et al., 2005; Vishwanathan et al., 2010; Shervashidze et al., 2011) and use either their continuous attributes or categorical labels for evaluation. In particular, MUTAG, PTC-MR, NCI1 and D&D are equipped with categorical node labels only; ENZYMES, PROTEINS have both categorical labels and continuous attributes; IMDB-B, BZR and COX2 only contain continuous attributes; finally, BZR-MD and COX2-MD have both continuous node attributes and edge weights. Further information on the data sets are available in Supplementary Table A.1. Additionally, we also report results on synthetic data (Synthie and SYNTHETIC-new) in Appendix A.5. All the data sets have been downloaded from Kersting et al. (2016).

5.2 Experimental setup

We compare WWL with state-of-the-art graph kernel methods from the literature as well as relevant baselines, which we trained ourselves on the same splits (see below). In particular, for the categorical case we compare with WL (Shervashidze and Borgwardt, 2009) and WL-OA (Kriege et al., 2016) as well as with the vertex (V) and edge (E) histograms. For the continuously attributed data sets, we compare with two instances of the Hash Graph Kernel (HGK-SP; HGK-WL) (Morris et al., 2016), and with the Graph Hopper (GH) (Feragen et al., 2013). As a further comparison partner, we use a continuous vertex histogram (VH-C), which is defined as an RBF kernel between the sum of the graph node embeddings. Furthermore, to highlight the benefits of using the Wasserstein distance in our method, we replace it with an RBF kernel. More precisely, given two graphs and , with and , we first compute the Gaussian kernel between each pair of the node embeddings obtained in the same fashion as for WWL, therefore obtaining a kernel matrix between node embeddings ; then, we sum up the values = and set . This procedure is repeated for each pair of graphs to obtain the final graph kernel matrix. We refer to this baseline as RBF-WL.

As a classifier, we use an SVM (or a KSVM in the case of WWL) and 10-fold cross-validation, selecting the parameters on the training set only. We repeat each cross-validation split 10 times and report the average accuracy. We employ the same split for each evaluated method, thus guaranteeing a fully comparable setup among all evaluated methods. Please refer to Appendix A.6

for details on the hyperparameter selection.


Available Python implementations can be used to compute the WL kernel (Sugiyama et al., 2018) and the Wasserstein distance (Flamary and Courty, 2017). We leverage these resources to develop our method and make our code publicly available. We use the original implementations provided by the respective authors to compute the WL-OA, the HGK, and the GH methods.

5.3 Results and discussion

The results are evaluated via classification accuracy and summarised in Table 1 and Table 2111

The best performing methods up to the resolution implied by the standard deviation across repetitions are highlighted in boldface. Additionally, to evaluate significance we perform 2-sample t-tests with a 0.05 threshold and Bonferroni correction for multiple hypothesis testing within each data set, significantly outperforming methods are denoted by an asterisk.

for the categorical labels and continuous attributes, respectively.

5.3.1 Categorical labels

On the categorical data sets, WWL is comparable to the WL-OA kernel; it improves over the classical WL, though. In particular, WWL largely improves over WL-OA in PTC-MR and is slightly better on D&D, while WL-OA is better on NCI1 and PROTEINS.

Unsurprisingly, our approach is comparable to the WL-OA, whose main idea is to solve the optimal assignment problem by defining Dirac kernels on histograms of node labels, using multiple iterations of WL. This formulation is similar to the one we provide for categorical data, but relies on optimal assignment rather than optimal transport and therefore requires one-to-one mappings instead of continuous transport maps. Beside, we solve the optimal transport problem on the concatenated embeddings, hereby jointly exploiting representations at multiple WL iterations. Contrarily, the WL-OA performs optimal assignment at each iteration of WL and only combines them in a second stage. However, the key advantage of WWL over WL-OA is its capacity to account for continuous attributes.

5.3.2 Continuous attributes

In this setting, WWL significantly outperforms the other methods on out of data sets, is better on another one, and is on a par on the remaining . We further compute the average rank of each method in the continuous setting, with WWL scoring as first. The ranks calculated from Table 2 are: WWL = ; HGK-WL = ; RBF-WL = ; HGK-SP = ; VH-C = . This is a remarkable improvement over the current state of the art, and it indeed establishes a new one. When looking at the average rank of the method, WWL always scores first. We thus raise the bar in kernel graph classification for attributed graphs. As mentioned in Section 4, the kernel obtained from continuous attributes is not necessarily positive definite. However, we empirically observe the kernel matrices to be positive definite (up to a numerical error), further supporting our theoretical considerations (see Appendix A.1). In practice, the difference between the results obtained from classical SVMs in RKHS and the results obtained with the KSVM approach are negligible.

Comparison with Hash Graph Kernels.

The Hash Graph Kernel (HGK) approach is somewhat related to our propagation scheme. By using multiple hashing functions, the HGK method is capable to extend certain existing graph kernels to the continuous setting. This helps avoid the limitations of perfect hashing, which cannot express small differences in continuous attributes. A drawback of the random hashing performed by HGK is that it requires additional parameters and introduces a stochastic element to the kernel matrix computation. By contrast, our propagation scheme is fully continuous and uses the Wasserstein distance to capture small differences in distributions of continuous node attributes. Moreover, the observed performance gap suggests that an entirely continuous representation of the graphs provides clear benefits over the hashing.

6 Conclusion

In this paper, we present a new family of graph kernels, the Wasserstein Weisfeiler–Lehman (WWL) graph kernels. Our experiments show that WWL graph kernels constitute the new state of the art for graph classification in the scenario of continuous node attributes, while matching the state of the art in the categorical setting. As a line of research for future work, we see a great potential in the runtime improvement, thus allowing applications of our method on regimes and data sets with larger graphs. In fact, preliminary experiments (see Section A.7 as well as Figure A.1 in the Appendix) already confirm the benefit of Sinkhorn regularisation, when the average number of nodes in the graph increases. In parallel, it would be beneficial to derive approximations of the explicit feature representations in the RKKS, as this would also provide a consistent speedup. We further envision that major theoretical contributions could be made by defining theoretical bounds to ensure the positive definiteness of the WWL kernel in the case of continuous node attributes. Finally, optimisation objectives based on optimal transport could be employed to develop new algorithms based on graph neural networks (Duvenaud et al., 2015; Kipf and Welling, 2017). On a more general level, our proposed method provides a solid foundation and highlight the large potential of optimal transport theory for machine learning.


  • Altschuler et al. [2017] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems, pages 1964–1974, 2017.
  • Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • Balcan et al. [2008] M.-F. Balcan, A. Blum, and N. Srebro. A theory of learning with similarity functions. Machine Learning, 72(1-2):89–112, 2008.
  • Berg et al. [1984] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100. Springer, 1984.
  • Borgwardt and Kriegel [2005] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Fifth IEEE International Conference on Data Mining, page 8. IEEE, 2005.
  • Borgwardt et al. [2005] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21:i47–i56, 2005.
  • Bridson and Häfliger [2013] M. R. Bridson and A. Häfliger. Metric spaces of non-positive curvature. Springer, Heidelberg, Germany, 2013.
  • Cuturi [2013] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • Duvenaud et al. [2015] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
  • Feragen et al. [2013] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt. Scalable kernels for graphs with continuous attributes. In Advances in Neural Information Processing Systems, pages 216–224, 2013.
  • Feragen et al. [2015] A. Feragen, F. Lauze, and S. Hauberg. Geodesic exponential kernels: When curvature and linearity conflict. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3032–3042, 2015.
  • Figalli and Villani [2011] A. Figalli and C. Villani. Optimal transport and curvature. In Nonlinear PDE’s and Applications, pages 171–217. Springer, 2011.
  • Flamary and Courty [2017] R. Flamary and N. Courty. Pot python optimal transport library, 2017. URL https://github.com/rflamary/POT.
  • Frogner et al. [2015] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
  • Fröhlich et al. [2005] H. Fröhlich, J. K. Wegner, F. Sieker, and A. Zell. Optimal assignment kernels for attributed molecular graphs. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 225–232, New York, NY, USA, 2005. ACM.
  • Gardner et al. [2017] A. Gardner, C. A. Duncan, J. Kanno, and R. R. Selmic. On the definiteness of earth mover’s distance and its relation to set intersection. IEEE Transactions on Cybernetics, 2017.
  • Haasdonk and Bahlmann [2004] B. Haasdonk and C. Bahlmann. Learning with distance substitution kernels. In Joint Pattern Recognition Symposium, pages 220–227. Springer, 2004.
  • Haussler [1999] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California, 1999.
  • Kersting et al. [2016] K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.de.
  • Kipf and Welling [2017] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • Klicpera et al. [2019] J. Klicpera, A. Bojchevski, and S. Günnemann. Combining neural networks with personalized pagerank for classification on graphs. In International Conference on Learning Representations (ICLR), 2019.
  • Kolouri et al. [2016] S. Kolouri, Y. Zou, and G. K. Rohde. Sliced wasserstein kernels for probability distributions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5258–5267, 2016.
  • Kriege and Mutzel [2012] N. Kriege and P. Mutzel. Subgraph matching kernels for attributed graphs. arXiv preprint arXiv:1206.6483, 2012.
  • Kriege et al. [2016] N. M. Kriege, P.-L. Giscard, and R. C. Wilson. On valid optimal assignment kernels and applications to graph classification. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1623–1631, 2016.
  • Loosli et al. [2015] G. Loosli, S. Canu, and C. S. Ong. Learning svm in kreĭn spaces. IEEE transactions on pattern analysis and machine intelligence, 38(6):1204–1216, 2015.
  • Morris et al. [2016] C. Morris, N. M. Kriege, K. Kersting, and P. Mutzel. Faster kernels for graphs with continuous attributes via hashing. In IEEE 16th International Conference on Data Mining (ICDM), pages 1095–1100, 2016.
  • Neumann et al. [2016] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: efficient graph kernels from propagated information. Machine Learning, 102(2):209–245, 2016.
  • Oglic and Gärtner [2018] D. Oglic and T. Gärtner. Learning in reproducing kernel kreın spaces. In Proceedings of the thirty-fifth International Conference on Machine Learning, 2018.
  • Ong et al. [2004] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positive kernels. In Proceedings of the twenty-first International Conference on Machine Learning. ACM, 2004.
  • Peyré et al. [2019] G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • Rabin et al. [2011] J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.
  • Rubner et al. [2000] Y. Rubner, C. Tomasi, and L. J. Guibas.

    The earth mover’s distance as a metric for image retrieval.

    International journal of computer vision, 40(2):99–121, 2000.
  • Schölkopf [2001] B. Schölkopf. The kernel trick for distances. In Advances in neural information processing systems, pages 301–307, 2001.
  • Schölkopf and Smola [2002] B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • Shervashidze and Borgwardt [2009] N. Shervashidze and K. M. Borgwardt. Fast subtree kernels on graphs. In Advances in neural information processing systems, 2009.
  • Shervashidze et al. [2011] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
  • Shin-Ichi [2012] O. Shin-Ichi. Barycenters in alexandrov spaces of curvature bounded below. In Advances in Geometry, volume 14. De Gruyter, 2012.
  • Stevens [1946] S. S. Stevens. On the theory of scales of measurement. Science, 1946.
  • Sugiyama et al. [2018] M. Sugiyama, M. E. Ghisu, F. Llinares-López, and K. Borgwardt. graphkernels: R and python packages for graph comparison. Bioinformatics, 34(3):530–532, 2018.
  • Turner et al. [2014] K. Turner, Y. Mileyko, S. Mukherjee, and J. Harer. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 52:44–70, Jul 2014.
  • Vert [2008] J.-P. Vert. The optimal assignment kernel is not positive definite. arXiv preprint arXiv:0801.4061, 2008.
  • Villani [2008] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Vishwanathan et al. [2010] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
  • Xu et al. [2019] H. Xu, D. Luo, H. Zha, and L. Carin. Gromov-wasserstein learning for graph matching and node embedding. arXiv preprint arXiv:1901.06003, 2019.
  • Yanardag and Vishwanathan [2015] P. Yanardag and S. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.

Appendix A Appendix

a.1 Extended considerations on WWL definiteness

We will now discuss on the positive definite nature of our WWL kernel.

In general, whether distances obtained from optimal transport problems can be used to create positive definite kernels remains an open research question. Several attempts to draw general conclusions on the definiteness of the Wasserstein distance were unsuccessful, but insightful results on particular cases were obtained along the way. Here, we first collect some of these contributions and use them to prove that our WWL Kernel for categorical embeddings is positive definite. We then elaborate further on the continuous embeddings case, for which we provide conjectures on practical conditions to obtain a pd kernel.

Before proceeding, let us remind some useful notions.

Definition 6.

Schölkopf and Smola [2002] A symmetric function is called a positive definite (pd) kernel if it satisfies the condition


for every , and .

The matrix of kernel values with entries is called the Gram matrix of with respect to . A conditional positive definite (cpd) kernel is a function that satisfies Equation 11 for all with , whereas a conditional negative definite kernel is a function that satisfies for all with .

For Euclidean spaces, obtaining kernels from distance functions is a well-studied topic.

Proposition 1.

[Haasdonk and Bahlmann, 2004] Let be a symmetric, non-negative distance function with . If is isometric to an -norm, then


is a valid cpd kernel.

However, the Wasserstein distance in its general form is not isometric to an -norm as the metric space it induces strongly depends on the chosen ground distance [Figalli and Villani, 2011]. More recently, Feragen et al. [2015] argued that many types of data, including probability distributions, do not always reside in Euclidean spaces. Therefore, they define the family of exponential kernels relying on a non-Euclidean distance , as follows


and, based on earlier considerations from Berg et al. [1984], show that, under certain conditions, the Laplacian kernel ( in Equation 13) is positive definite.

Proposition 2.

[Feragen et al., 2015] The geodesic Laplacian kernel is positive definite for all if and only if the geodesic distance is conditional negative definite.

Once again, considerations on the negative definiteness of Wasserstein distance functions cannot be made on the general level. Certain ground distances, however, guarantee the negative definiteness of the resulting Wasserstein distance. In particular, the Wasserstein distance with the discrete metric (i.e.  in Equation 8) as the ground distance, was proved to be conditional negative definite [Gardner et al., 2017].

We will now leverage these results to prove that the Wasserstein distance equipped with the Hamming ground distance is conditional negative definite, therefore it yields positive definite kernels for the categorical WL embeddings.

a.1.1 The case of categorical embeddings

When generating node embeddings using the Weisfeiler–Lehman labelling scheme with a shared dictionary across all the graphs, the solutions to the optimal transport problem are also shared across iterations. We denote the Weisfeiler–Lehman embedding scheme as defined in Definition 4 as , and let be the corresponding GWD on a set of graphs with categorical labels. Let of Equation 8 be the ground distance of . Then, the following useful results hold.

Lemma 1.

If a transportation plan with transport matrix is optimal in the sense of Definition 1 for distances between embeddings obtained with , then it is also optimal for the discrete distances between the -th iteration values obtained with the Weisfeiler–Lehman procedure.

Proof. See Appendix A.2

Lemma 2.

If a transportation plan with transport matrix is optimal in the sense of Definition 1 for distances between embeddings obtained with , then it is also optimal for distances between embeddings obtained with .

Proof. See Appendix A.3

We therefore postulate that the Wasserstein distance between categorical WL node embeddings is a conditional negative definite function.

Theorem 2.

is a conditional negative definite function.

Proof. See Appendix A.4

Proof of Theorem 1. Theorem 2 in light of Proposition 2 implies that the WWL kernel of Definition 5 is positive definite for all .

We will now consider the case of the definiteness of kernels in the continuous setting.

a.1.2 The case of continuous embeddings

On one hand, in the categorical case, we proved the positive definiteness of our kernel. On the other hand, the continuous case is considerably harder to tackle. We conjecture that, under certain conditions, the same might hold for continuous features. While we do not have a formal proof yet, in what follows, we discuss arguments to support this conjecture which seems to agree with our empirical findings.222We observe that, for all considered data sets and after standardisation of the input features prior to the embedding scheme, GWD matrices are conditional negative definite.

The curvature of the metric space induced by the Wasserstein metric for a given ground distance plays an important role here. We first need to define Alexandrov spaces.

Definition 7 (Alexandrov space).

Given a metric space and a real number , the space is called an Alexandrov space if its sectional curvature is .

Roughly speaking, the curvature indicates to what extent a geodesic triangle will be deformed in the space. The case of is special as no distortion is happening here—hence, spaces that satisfy this property are called flat. The concept of Alexandrov spaces is required in the following proposition, taken from a theorem by Feragen et al. [2015], which shows the relationship between a kernel and its underlying metric space.

Proposition 3.

The geodesic Gaussian kernel (i.e. in Equation 13) is positive definite for all if and only if the underlying metric space is flat in the sense of Alexandrov, i.e. if any geodesic triangle in can be isometrically embedded in a Euclidean space.

However, it is unlikely that the space induced by the Wasserstein distance is locally flat, as not even the geodesics (i.e. a generalization of a shortest path to arbitrary metric spaces) between graph embeddings are necessarily unique, as we shall show below. Hence, we use the geodesic Laplacian kernel instead of the Gaussian one because it poses less strict requirements on the induced space, as stated in Proposition 2. More precisely, the metric used in the kernel function needs to be cnd. We cannot directly prove this yet, but we can provide a proof that the converse is not true. To this end, we first notice that the metric space induced by the GWD, which we shall here refer to as , does not have a curvature that is bounded from above.

Definition 8.

A metric space is said to be if its curvature is bounded by some real number from above. This can also be seen a “relaxed” definition, or generalisation, of a Riemannian manifold.

Theorem 3.

is not in for any , meaning that its curvature is not bounded by any from above.


This follows from a similar argument as presented by Turner et al. [2014]. We briefly sketch the argument. Let and be two graphs. Assume that is a space for some . Then it follows [Bridson and Häfliger, 2013, Proposition 2.11, p. 23] that if , there is a unique geodesic between them. However, we can construct a family of graph embeddings for which this is not the case. To this end, let and and be two graph embeddings with node embeddings , as well as and , respectively. Since we use the Euclidean distance as a ground distance, there will be two optimal transport plans: the first maps to and to , while the second maps to and to . Hence, we have found two geodesics that connect and . Since we may choose to be arbitrarily small, the space cannot be for . ∎

While this means that we do not have a simple upper bound on the curvature, we have the following conjecture.

Conjecture 1.

is an Alexandrov space with curvature bounded from below by zero.

For a proof idea, we refer to Turner et al. [2014]; the main argument involves characterizing the distance between triples of graph embeddings. This conjecture is helpful insofar as being a non-negatively-curved Alexandrov space is a necessary prerequisite for to be a Hilbert space [Shin-Ichi, 2012]. In turn, Feragen et al. [2015] shows that cnd metrics and Hilbert spaces are intricately linked. We thus have some hope in obtaining a cnd metric, even though we lack a proof as of now. Our empirical results, however, indicate that it is possible to turn the GWD into a cnd metric, provided a proper normalisation takes place. Intuitively, for high dimensional input spaces, standardisation of input features changes the curvature of the induced space by making it locally (almost) flat.

To support this argumentation, we refer to an existing way to ensure positive definiteness. One can use an alternative to the classical Wasserstein distance denoted as the sliced Wasserstein [Rabin et al., 2011]. The idea is to project high dimensional ditributions into one-dimensional spaces, hereby calculating the Wasserstein distance as a combination of one dimensional representations. Kolouri et al. [2016] showed that each of the one-dimensional Wasserstein distances are conditional negative definite. The kernel on high-dimensional representations is then defined as a combination of the one-dimensional positive definite counterparts.

a.2 Proof of Lemma 1


We recall the matrix notation introduced in Equation 2 of the main paper, where is the cost or distance matrix, is a transport matrix (or joint probability), and is the Frobenius dot product. Since we give equal weight (i.e. equal probability mass) to each of the vectors in each set, contains all nonnegative matrices with

For notation simplicity, let us denote by the Hamming matrix , where the -th entry is given by the Hamming distance between the embedding of the -th node of graph and the embedding of the -th node of graph at iteration . Similarly, we define to be the discrete metric distance matrix, where the -th entry is given by the discrete distance between feature of node embedding of graph and feature of node embedding of graph . It is easy to see that while and, by definition,

Moreover, due to the WL procedure, two labels that are different at iteration will also be different at iteration . Hence, the following identity holds

Which implies that . An optimal transportation plan for embeddings satisfies

Assuming that is not optimal for , we can define such that

Since the entries of are either or , we can define the set of indices tuples and rewrite the inequality as

Due to the constraints on the entries of and , namely , this implies that, by rearranging the transport map, there is more mass that could be transported at cost. In our formalism:

However, as stated before, entries of that are are also in and a better transport plan would therefore also be optimal for :

Which contradicts the optimality assumption above. Hence, is also optimal for .

a.3 Proof of Lemma 2


Intuitively, the transportation plan at iteration is a “refinement” of the transportation plan at iteration , where only a subset of the optimal transportation plans remain optimal for the new cost matrix . Using the same notation as for the Proof in Appendix A.2, and due to the WL procedure, two labels that are different at iteration will also be different at iteration . Hence, the following identities hold

An optimal transportation plan for embeddings satisfies

Which can also be written as

The values of increase in a step-wise fashion for increasing and their ordering remains constant except for entries that were at iteration and become at iteration . Hence, since our metric distance matrices satisfy monotonicity conditions and because is optimal for according to Lemma 1, it follows that

Hence, is also optimal for embeddings. ∎

a.4 Proof of Theorem 2


Using the same notation as for the Proof in Appendix A.2 and the formulation in Equation 2, we can write

Let be an optimal solution for iteration then, from Lemmas 1 and 2, it is also an optimal solution for as well as for all and we can rewrite the equation as a sum of optimal transport problems


This corresponds to a sum of 1-dimensional optimal transport problems relying on the discrete metric, which were shown to be conditional negative functions [Gardner et al., 2017]. The final sum is therefore conditional negative definite as well. ∎

a.5 Data sets and additional results

We report additional information on the data sets used in our experimental comparison in Supplementary Table A.1

Data set Class Node Node Edge Graphs Classes
Ratio Labels Attributes Weights
NCI1 - -
PTC-MR - -
DD - -
ENZYMES per class -
COX2 -
SYNTHIE per class - -
BzR-MD -
Table A.1: Table reporting information on the various data sets

Our data sets belong to multiple chemoinformatics domains, including small molecules (MUTAG, PTC-MR, NCI1), macromolecules (ENZYMES, PROTEINS, D&D) and chemical compounds (BZR, COX2). We further consider a movie collaboration dataset (IMDB, see [Yanardag and Vishwanathan, 2015] for a desription) and two synthetic data sets Synthie and Synthetic-new, created by Morris et al. [2016] and Feragen et al. [2013], respectively. The BZR-MD and COX2-MD data sets do not have node attributes but contain the atomic distance between each connected atom as an edge weight. We do not consider distances between non-connected nodes [Kriege and Mutzel, 2012]

and we equip the node with one-hot-encoding categorical attribtues representing the atom type, i.e. what is originally intended as a categorical node label. On

IMDB-B, IMDB-BINARY was used with the node degree as a (semi-)continuous feature for each node [Yanardag and Vishwanathan, 2015]. For all the other data sets, we use the off-the-shelf version provided by Kersting et al. [2016].

Results on synthetic data sets are provided in Table A.2. We decided not to include those in the main manuscript because of the very unstable and unreliable performances we obtained. In particular, for both data sets there is a very high variation among the different methods. Furthermore, we experimentally observed that even a slight modification of the node features (e.g. normalisation or scaling of the embedding scheme) resulted in large change in performances (up to ). Additionally, it has been previously reported [Morris et al., 2016, Feragen et al., 2013] that on Synthetic-new, a WL with degree treated as categorical node label outperforms the competitors, suggesting that the continuous attribtues are indeed not very informative. Therefore, we excluded these data sets from the main manuscript, as we concluded that they are unsuitable to provide a fair assessment of methods quality.

Table A.2: Classification accuracies on synthetic graphs with continuous node attributes. Comparison of Hash-Graphs Kernel (HGK-WL, HGK-SP); Graph Hopper Kernel (GH); and our method WWL.

a.6 Details on hyperparameter selection

The following ranges are used for the hyperparameter selection: the parameter of the SVM (for continuous attributes) and (for categorical attributes); the WL number of iterations ; the parameter of the WWL . For the RBF-WL and VH-C we use default parameter for the Gaussian kernel, i.e , where is the size of node attributes. For the GH kernel, we also fix the parameter to , and for HGK we fix the number of iterations to for each data set except for SYNTHETICnew, where we use (these setups were suggested by the respective authors [Morris et al., 2016, Feragen et al., 2013]. Furthermore, since HGK is a randomised method, we compute each kernel matrix times and average the results. When the dimensionality of the continuous attributes , these are normalised to ensure comparability among the different feature scales, in each data set but BZR and COX2, due to the meaning of the node attributes being location coordinates.

a.7 Runtime comparison

Overall, we note that WL and WL-OA scale linearly with the number of nodes, and will therefore be faster than our approach. Due to the differences in programming language implementations of the different methods, it is hard to provide an accurate runtime comparison. However, we empirically observe that the Wasserstein Graph Kernels are still competitive and a kernel matrix can be computed in a median time of s, depending on the size and number of graphs (see Figure A.1). For the continuous attributes, our approach has a runtime in the same order of magnitude as the GH. However, while our approach can benefit from significant speed-up (see discussion below and Section 5.2), the GH was shown to empirically scale quadratically with the number of nodes in the graphs [Feragen et al., 2013]. The HGK, on the other hand, is considerably slower given the number of iterations and multiple repetitions needed while taking into account the randomisation.

To evaluate our approach with respect to the size of the graphs and recalling that computing the Wasserstein distance has complexity , we simulated a fixed number of graphs with a varying average number of nodes per graph. We generated random node embeddings for

where the number of nodes is taken from a normal distribution centered around the average number of nodes. We then computed the kernel matrix on each set of graphs to compare the runtime of regular Wasserstein with the Sinkhorn regularized optimization. As shown in Supplementary Figure 

A.1, the speedup starts to become beneficial at around 100 nodes per graph on average, which is larger than the average number of nodes in the benchmark data sets we used.

In order to ensure that good performance is maintained when using the Sinkhorn approximation, we evaluate the obtained accuracy of the model. Recalling that the Sinkhorn method solves the following entropic regularization problem:

we further need to select . Therefore, on top of the cross-validation scheme described above, we further cross-validate over the regularization parameter values of for the Enzymes data set and obtain an accuracy of , which remains above the current state-of-the-art. The ’s selected most of the time are , and .

Figure A.1: Runtime performance of the WWL Kernel computation step with a fixed number of graphs. We also report the time taken to compute the ground distance matrix as distance_time. total_time is the sum of the time to compute the ground distance and the time taken to solve the optimal transport (ot) problem for the regular solver or the sinkhorn-regularized one. The logarithmic scale on the right side figure shows how for a small average number of nodes, the overhead to run sinkhorn is higher than the benefits.