Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding

11/25/2019 ∙ by Lingfei Wu, et al. ∙ 0

Graph kernels are widely used for measuring the similarity between graphs. Many existing graph kernels, which focus on local patterns within graphs rather than their global properties, suffer from significant structure information loss when representing graphs. Some recent global graph kernels, which utilizes the alignment of geometric node embeddings of graphs, yield state-of-the-art performance. However, these graph kernels are not necessarily positive-definite. More importantly, computing the graph kernel matrix will have at least quadratic time complexity in terms of the number and the size of the graphs. In this paper, we propose a new family of global alignment graph kernels, which take into account the global properties of graphs by using geometric node embeddings and an associated node transportation based on earth mover's distance. Compared to existing global kernels, the proposed kernel is positive-definite. Our graph kernel is obtained by defining a distribution over random graphs, which can naturally yield random feature approximations. The random feature approximations lead to our graph embeddings, which is named as "random graph embeddings" (RGE). In particular, RGE is shown to achieve (quasi-)linear scalability with respect to the number and the size of the graphs. The experimental results on nine benchmark datasets demonstrate that RGE outperforms or matches twelve state-of-the-art graph classification algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graph kernels are one of the most important methods for graph data analysis and have been successfully applied in diverse fields such as disease and brain analysis (Mokhtari and Hossein-Zadeh, 2013; Chen and Wu, 2017), chemical analysis (Ralaivola et al., 2005), image action recognition and scene modeling (Wang and Sahbi, 2013; Fisher et al., 2011), and malware analysis(Wagner et al., 2009). Since there are no explicit features in graphs, a kernel function corresponding to a high-dimensional feature space provides a flexible way to represent each graph and to compute similarities between them. Hence, much effort has been devoted to designing feature spaces or kernel functions for capturing similarities between structural properties of graphs.

The first line of research focuses on local patterns within graphs (Gärtner et al., 2003; Shervashidze and Borgwardt, 2009). Specifically, these kernels recursively decompose the graphs into small sub-structures, and then define a feature map over these sub-structures for the resulting graph kernel. Conceptually, these notable graph kernels can be viewed as instances of a general kernel-learning framework called R-convolution for discrete objects (Haussler, 1999; Shervashidze et al., 2011). However, the aforementioned approaches consider only local patterns rather than global properties, which may substantially limit effectiveness in some applications, depending on the underlying structure of graphs. Equally importantly, most of these graph kernels scale poorly to large graphs due to their at least quadratic time complexity in terms of the number of graphs and cubic time complexity in terms of the size of graphs.

Another family of methods use geometric embeddings of graph nodes to capture global properties, which has shown great promise, achieving state-of-the-art performance in graph classification (Johansson et al., 2014; Johansson and Dubhashi, 2015; Nikolentzos et al., 2017). However, these global graph kernels based on matching node embeddings between graphs may suffer from the loss of positive definiteness. Furthermore, the majority of these approaches have at least quadratic complexity in terms of either the number of graph samples or the size of the graph.

To address these limitations of existing graph kernels, we propose a new family of global graph kernels that take into account the global properties of graphs, based on recent advances in the distance kernel learning framework (Wu et al., 2018b). The proposed kernels are truly positive-definite (p.d.) kernels constructed from a random feature map given by a transportation distance between a set of geometric node embeddings of raw graphs and those of random graphs sampled from a distribution. In particular, we make full use of the well-known Earth Mover’s Distance (EMD), computing the minimum cost of transporting a set of node embeddings of raw graphs to the ones of random graphs. To yield an efficient computation of the kernel, we derive a Random Features (RF) approximation using a limited number of random graphs drawn from either data-independent or data-dependent distributions. The methods used to generate high-quality random graphs have a significant impact on graph learning. We propose two different sampling strategies depending on whether we use node label information or not. Furthermore, we note that each building block in this paper - geometric node embeddings and EMD - can be replaced by other node embeddings methods (Johansson and Dubhashi, 2015; Zhang et al., 2018) and transportation distances (Solomon et al., 2016). Our code is available at https://github.com/IBM/RandomGraphEmbeddings.

We highlight the main contributions as follows:

  • We propose a class of p.d. global alignment graph kernels based on their global properties derived from geometric node embeddings and the corresponding node transportation.

  • We present Random Graph Embeddings (RGE), a by-product of the RF approximation, which yields an expressive graph embedding. Based on this graph embedding, we significantly reduce computational complexity at least from quadratic to (quasi-)linear in both the number and the size of the graphs.

  • We theoretically show the uniform convergence of RGE. We prove that given random graphs, the inner product of RGE can uniformly approximates the corresponding exact graph kernel within

    precision, with high probability.

  • Our experimental results on nine benchmark datasets demonstrate that RGE outperforms or matches twelve state-of-the-art graph classification algorithms including graph kernels and deep graph neural networks. In addition, we numerically show that RGE can achieve (quasi-)linear scalability with respect to both the number and the size of graphs.

2. Related Work

In this section, we first make a brief survey of the existing graph kernels and then detail the difference between conventional random features method for vector inputs

(Rahimi and Recht, 2008) and our random features method for structured inputs.

2.1. Graph Kernels

Generally speaking, we can categorize the existing graph kernels into two groups: kernels based on local sub-structures, and kernels based on global properties.

The first group of graph kernels compare sub-structures of graphs, following a general kernel-learning framework, i.e., R-convolution for discrete objects (Haussler, 1999). The major difference among these graph kernels is rooted in how they define and explore sub-structures to define a graph kernel, including random walks (Gärtner et al., 2003), shortest paths (Borgwardt and Kriegel, 2005), cycles (Horváth et al., 2004), subtree patterns (Shervashidze and Borgwardt, 2009), and graphlets (Shervashidze et al., 2009). A thread of research attempts to utilize node label information using the Weisfeiler-Leman (WL) test of isomorphism (Shervashidze et al., 2011) and takes structural similarity between sub-structures into account (Yanardag and Vishwanathan, 2015b, a) to further improve the performance of kernels.

Recently, a new class of graph kernels, which focus on the use of geometric node embeddings of graph to capture global properties, are proposed. These kernels have achieved state-of-the-art performance in the graph classification task (Johansson et al., 2014; Johansson and Dubhashi, 2015; Nikolentzos et al., 2017). The first global kernel was based on the Lovász number (Lovász, 1979) and its associated orthonormal representation (Johansson et al., 2014). However, these kernels can only be applied on unlabelled graphs. Later approaches directly learn graph embeddings by using landmarks (Johansson and Dubhashi, 2015) or compute a similarity matrix (Nikolentzos et al., 2017) by exploiting different matching schemes between geometric embeddings of nodes of a pair of graphs. Unfortunately, the resulting kernel matrix does not yield a valid p.d. kernel and thus delivers a serious blow to hopes of using kernel support machine. Two recent graph kernels, the multiscale laplacian kernel (Kondor and Pan, 2016) and optimal assignment kernel (Kriege et al., 2016) were developed to overcome these limitations by building a p.d. kernel between node distributions or histogram intersection.

However, most of existing kernels only focus on learning kernel matrix for graphs instead of graph-level representation, which can only be used for graph classification rather than other graph related tasks (e.g., graph matching). More importantly, how to align the nodes in two graphs plays a central role in learning a similarity score. In this paper, we rely on an optimal transportation distance (e.g., Earch Mover’s Distance) to learn the alignment between corresponding nodes that have similar structural roles in graphs, and directly generate a graph-level representation (embedding) for each graph instead of explicitly computing a kernel matrix.

2.2. Random Features for Kernel Machines

Over the last decade, the most popular approaches for scaling up kernel method is arguably random features approximation and its fruitful variants (Rahimi and Recht, 2008; Sinha and Duchi, 2016; Bach, 2017; Wu et al., 2018a). Given a predefined kernel function, the inner product of RF directly approximates the exact kernel via sampling from a distribution, which leads to a fast linear method for computing kernel based on the learned low-dimensional feature representation. However, these RF approximation methods can only be applied to the shift-invariant kernels (e.g., the Gaussian or Laplacian kernels) with vector-form input data. Since a graph is a complex object, the developed graph kernels are neither shift-invariant kernels nor with vector-form inputs. Due to these challenges, to the best of our knowledge, there are no existing studies on how to develop the RF approximation for graph kernels.

A recent work, called D2KE (distances to kernels and embeddings) (Wu et al., 2018b), proposes the general methodology of the derivation of a positive-definite kernel through a RF map from any given distance function, which enjoys better theoretical guarantees than other distance-based methods. In (Wu et al., 2018c), D2KE was extended to design a specialized time-series embedding and showed the strong empirical performance for time-series classification and clustering. We believe there is no work on applying D2KE to the graph kernel domain 111Upon acceptance of this paper, a parallel work (Al-Rfou et al., 2019) also adopted D2KE to develop an unsupervised neural network model for learning graph-level embedding.. Our work is the first one to build effective and scalable global graph kernels using Random Features.

3. Geometric Embeddings of Graphs and Earth Mover’s Distance

In this section, we will introduce two important building blocks of our method, the geometric node embeddings that are used to represent a graph as a bag-of-vectors, and the well-known transportation distance EMD.

3.1. Geometric Embeddings of Graphs

The following notation will be used throughout the paper. Let a graph consisting of nodes, edges, and discrete node labels be represented as a triplet , where is the set of vertices, is the set of undirected edges, and is a function that assigns the label information to nodes from an alphabet set . In this paper, we will consider both unlabeled graphs and graphs with discrete node labels. Let be a set of graphs where and let be a set of graph labels 222Note that there are two types of labels involved in our paper, i.e., the node labels and the graph labels. The node labels characterize the property of nodes. The graph labels are the classes that graph belongs to. corresponding to each graph in where . Let the geometric embeddings of a graph be a set of vectors for all nodes, where the vector in is the representation of the node , and is the size of latent node embedding space.

Typically, with different underlying learning tasks, a graph can be characterized by different forms of matrices. Without loss of generality, we use the normalized Laplacian matrix , where is the adjacency matrix with if and otherwise, and is the degree matrix. We then compute the

smallest eigenvectors of

to obtain as its geometric embeddings through the partial eigendecomposition of . Then each node will be assigned an embedding vector where is the i-th row of the absolute . Note that since the signs of the eigenvectors are arbitrary, we use the absolute values. Let be the th item of the vector , then it satisfies . Therefore, the node embedding vectors can be viewed as points in a d-dimensional unit hypercube. This fact plays an important role in our following sampling strategy.

Note that although the standard dense eigensolvers require at least cubic time complexity in the number of graph nodes, with a state-of-the-art iterative eigensolver (Stathopoulos and McCombs, 2010; Wu et al., 2017b)

, we can efficiently solve eigendecomposition with complexity that is linear in the number of graph edges. It is also worth noting that the resulting geometric nodes embeddings well capture global properties of the graph since the eigenvectors associated with low eigenvalues of

encode the information about the overall structure of based on the spectral graph theory (Von Luxburg, 2007).

In the traditional model of Natural Language Processing, a bag-of-words had been the most common way to represent a document. With modern deep learning approaches, each element such as a word in the document or a character in the string is embedded into a low-dimensional vector and is fed within a bag-of-vectors into recurrent neural networks that perform document and string classification. Similarly, we also represent each graph as bag-of-vectors using a set of geometric node embeddings. However, although there is canonical ordering for the nodes of a graph, it is not reliable in most case. Therefore, it is important to find an optimal matching between two sets of node embeddings when comparing two graphs.

3.2. Node Transportation via EMD

Now we assume that a graph is represented by the bag-of-vectors . To use the bag-of-words model, we also need to compute weights associated with each node vector. To be precise, if node has outgoing edges, we use as a normalized bag-of-words (nBOW) weight for each node. Our goal is to measure the similarity between a pair of graphs using a proper distance measure. Instead of treating it as an assignment problem solved by maximum weight matching as in (Johansson and Dubhashi, 2015), we cast the task as a well-known transportation problem (Hitchcock, 1941), which can be addressed by using the Earth Mover’s Distance (Rubner et al., 2000).

Using EMD, one can easily measure the dissimilarity between a pair of graphs through node transportation, which essentially takes into account alignments between nodes. Let denote the maximum number of nodes in a pair of graphs . Since is the nBOW weight vector for the graph , it is easy to obtain that . Similarly, we have . Then the EMD is defined as

(1)

where is the transportation flow matrix with denoting how much of node in travels to node in , and is the transportation cost matrix where each item denotes the distance between two nodes measured in their embedding space. Typically, the Euclidean distance is adopted. We note that with the distance is a metric in the embedding space, the EMD (1) also define a metric between two graphs (Rubner et al., 2000). An attractive attribute of the EMD is that it provides an accurate measurement of the distance between graphs with different nodes that are contextually similar but in different positions in the graph. The EMD distance has been observed to perform well on text categorization (Kusner et al., 2015) and graph classification (Nikolentzos et al., 2017). A straightforward way that defines a kernel matrix based on EMD that measures the similarity between graphs has been shown in (Nikolentzos et al., 2017) as follows:

(2)

where is the centering matrix and is the EMD distance matrix from all the pairs of graphs. However, there are three problems. The first one is that the Kernel matrix in (2) is not necessarily positive-definite. The second problem is that the EMD is expensive to compute, since its time complexity is . In addition, computing the EMD for each pair of graphs requires the quadratic time complexity in the number of graphs, which is highly undesirable for large-scale graph data. In this paper, we propose a scalable global alignment graph kernel using the random features to simultaneously address all these issues.

4. Scalable Global Alignment Graph Kernel Using Random Features

In this section, we first show how to construct a class of the p.d. global alignment graph kernels from an optimal transportation distance (e.g., EMD) and then present a simple yet scalable way to compute expressive graph embeddings through the RF approximation. We also show that the inner product of the resulting graph embeddings uniformly converge to the exact kernel.

4.1. Global Alignment Graph Kernel Using EMD and RF

The core task is to build a positive-definite graph kernel that can make full use of both computed geometric node embeddings for graphs and a distance measure considering the alignment of the node embeddings. We here define our global graph kernel as follows:

(3)

Here is a random graph consisting of random nodes with their associated node embeddings , where each random node embedding is sampled from a d-dimensional vector space . Thus, is a distribution over the space of all random graphs of variable sizes . Then we can derive an infinite-dimensional feature map from the EMD between and all possible random graphs . One explanation of how our proposed kernel works is that a small random graph can implicitly partition a larger raw graph through node transportation (or node alignments) in the corresponding node embedding space using EMD, as illustrated in Fig. 1.

Figure 1. An illustration of how the EMD is used to measure the distance between a random graph and a raw graph. Each small random graph implicitly partitions the larger raw graph through node alignments in a low dimensional node embedding space.

A more formal and revealing way to interpret our kernel defined in (3) is to express it as

(4)

where,

(5)

can be treated as the soft minimum function defined by two parameters and . Since the usual soft minimum is defined as , then Equation (5) can be regarded as its smoothed version, which uses parameter to control the degree of smoothness and is reweighted by a probability density . Interestingly, the value of (5) is mostly determined by the minimum of , when is Lipschitz-continuous and is large. Since EMD is a metric as discussed above, we have

by the triangle inequality. The equality holds if the maximum size of the random graph, , is equal or greater than the original graph size . Therefore, the kernel value in (4) serves as a good approximation to the EMD between any pair of graphs and . By the kernel definition, it must be positive-definite.

4.2. Random Graph Embedding: Random Features of Global Alignment Graph Kernel

In this section, we will introduce how to efficiently compute the proposed global alignment graph kernels and derive the random graph embedding that can be used for representing graph-level embedding from the geometric node embeddings.

4.2.1. Efficient Computation of RGE


Exact computation of the proposed kernel in (3) is often infeasible, as it does not admit a simple analytic solution. A natural way to compute such kernel is to resort to a kernel approximation that is easy to compute while uniformly converges to the exact kernel. As one of the most effective kernel approximation techniques, random features method has been demonstrated great successes in approximating Gaussian Kernel (Rahimi and Recht, 2008; Le et al., 2013) and Laplacian Kernel (Wu et al., 2016) in various applications. However, as we discussed before in Sec. 2.2, conventional RF methods cannot be directly applied to our graph kernels since they are not shift-invariant and cannot deal with the inputs that are not vector-form. Moreover, for traditional RF methods, we have to know the kernel function prior before hand, which is also not available in our case. However, fortunately, since we can define our kernel in terms of a randomized feature map, it naturally yields the following random approximation that does not require aforementioned assumptions,

(6)

where is a -dimensional vector with the th term , and are i.i.d. samples drawn from . Note that the vector just can be considered as the representation (embedding) of graph . We call this random approximation ”random graph embedding (RGE)”, a generalized concept of ”random features” for our graph inputs. We will also show that this random approximation RGE admits the uniform convergence to the original kernel (3) over all pairs of graphs .

Algorithm 1 summarizes the procedure to generate feature vectors for data graphs. There are several comments to make here. First of all, the distribution is the key to generating high-quality node embeddings for random graphs. We propose two different ways to generate random graphs, which we will illustrate in detail later. Second, the size

of the random graphs is typically quite small. An intuitive explanation why a small random graph captures important global information of raw graphs has been discussed in the previous section. However, since there is no prior information to determine how many random nodes is needed to segment the data graph for learning discriminatory features, we sample the size of the random graphs from a uniform distribution

to obtain an unbiased estimate of

. Finally, both node embedding and distance measures can be further improved by exploiting the latest advancements in these techniques.

1:Input: Data graphs , node embedding size , maximum size of random graphs , graph embedding size .
2:Output: Feature matrix for data graphs
3:Compute nBOW weights vectors of the normalized Laplacian of all graphs
4:Obtain node embedding vectors by computing smallest eigenvectors of
5:for  do
6:   Draw uniformly from .
7:   Generate a random graph with number of nodes embeddings from Algorithm 2.
8:   Compute a feature vector using EMD or other optimal transportation distance in Equation (3).
9:end for
10:Return feature matrix
Algorithm 1 Random Graph Embedding

By efficiently approximating the proposed global alignment graph kernel using RGE, we obtain the benefits of both improved accuracy and reduced computational complexity. Recall that the computation of EMD has time complexity and thus the existing graph kernels require at least computational complexity and memory consumption, where and are the number of graphs and the average size of graphs, respectively. Because of the small size of random graphs, the computation of EMD in our RGE approximation only requires (Bourgeois and Lassalle, 1971). It means that our RGE approximation only requires computation with the quasi-linear complexity if we treat as a constant (or a small number). Note that with a state-of-the-art eigensolver (Stathopoulos and McCombs, 2010; Wu et al., 2017b), we can effectively compute the largest eigenvectors with linear complexity , where is the number of graph edges and is the number, typically quite small, of iterations of iterative eigensolver. Therefore, the total computational complexity and memory consumption of RGE are and respectively. Compared to other graph kernels, our method reduces computational complexity from quadratic to linear in terms of the number of graphs, and from (quasi-)cubic to (quasi-)linear in terms of the graph size. We will empirically assess the computational runtime in the subsequent experimental section.

4.2.2. Data-independent and Data-dependent Distributions


Algorithm 2 details the two sampling strategies (data-independent and data-dependent distributions) for generating a set of node embeddings of a random graph. The first scheme is to produce random graphs from a data-independent distribution. Traditionally, conventional RF approximation has to obtain random samples from a distribution corresponding to the user predefined kernel (e.g., Gaussian or Laplacian kernels). However, since we reverse the order by firstly defining the distribution and then defining a kernel similar to (Wu et al., 2018b), we are free to select any distribution that can capture the characteristics of the graph data well. Given that all node embeddings are distributed in a -dimensional unit hypercube space, we first compute the largest and smallest elements in all node embeddings and then use a uniform distribution in the range of these two values to generate a set of -dimensional vectors for random node embeddings in a random graph. Since node embeddings are roughly dispersed uniformly in the -dimensional unit hypercube space, we found this scheme works well in most of cases. Like the traditional RF, this sampling scheme is data-independent. So we call it RGE(RF).

Another scheme is conceptually similar to recently proposed work on deriving data-dependent traditional random features (Ionescu et al., 2017) for vector-inputs, which have been shown to have a lower generalization error than data-independent random features (Rudi and Rosasco, 2017). However, unlike these conventional RF methods (Ionescu et al., 2017; Rudi and Rosasco, 2017) and the conventional landmarks method that selects a representative set of whole graphs (Johansson and Dubhashi, 2015), we propose a new way to sample parts of graphs (only from training data) as random graphs, which we refer to as the Anchor Sub-Graphs (ASG) scheme RGE(ASG). There are several potential advantages compared to lankmarks and RF methods. First of all, ASG opens the door to defining an indefinite feature space since there are conceptually unlimited numbers of sub-graphs, compared to the limited size (up to the number of graphs) of landmarks. Second, ASG produces a random graph by permuting graph nodes of the original graph and by resembling randomly their corresponding node embeddings in the node embedding space, which may help to identify more hidden global structural information instead of only considering the raw graph topology. Thanks to EMD, hidden global structure can be captured well through node alignments. Finally, unlike RGE(RF), the ASG scheme allows exploiting node-label information in raw graphs since this information is also accessible through the sampled nodes in sub-graphs.

Incorporating the node label information into RGE(ASG) is fairly straightforward; it is desirable to assign nodes with same labels a smaller distance than these with different labels. Therefore, we can simply set the distance if nodes and have different node labels since is the largest distance in a -dimensional unit hypercube space.

1:Input: Node embeddings , node embedding size , size of random graph .
2:Output: Random node embeddings
3:if Choose RGE(RF) then
4:   Compute maximum value and minimum value in .
5:   Generate a number of random node embedding vectors in a random graph drawn from .
6:else if Choose RGE(ASG) then
7:   Uniformly draw graph index and select the -th raw graph
8:   Uniformly draw a number of node indices in the -th raw graph
9:   Generate a number of random node embedding vectors as well as its associated node labels for a random graph
10:end if
11:Return nodes embeddings for a random graph
Algorithm 2 Random Graph Generation

4.3. Convergence of Random Graph Embedding

In this section, we establish a bound on the number of random graphs required to guarantee an approximation between the exact kernel (3) and its random feature approximation (6) denoted by . We first establish a covering number for the space under the EMD metric.

Lemma 0 ().

There is an -covering of under the metric defined by EMD with Euclidean ground distance such that

with , where is an upper bound on the number of nodes for any graph .

Proposition 0 ().

Let . We have that if , , where is an covering of , and is the parameter of , then , .

Thus, given random graphs, the inner product of RGE can uniformly approximates the corresponding exact graph kernel within precision, with high probability, as shown in the following Theorem.

Theorem 3 ().

The uniform convergence rate is

Therefore, to guarantee with probability at least , it suffices to have

Proof.

Based on Proposition 2, we have

(7)

For any , since and , from the Hoeffding inequality, we have

(8)

Therefore,

(9)

Combining (7) and (9), and setting , we obtain the desired result. ∎

The above theorem states that, to find an approximation to the exact kernel, it suffices to have number of random features . We refer interested readers to the details of the proof of Theorem (3) in Appendix A.

5. Experiments

We performed experiments to demonstrate the effectiveness and efficiency of the proposed method, and compared against a total of twelve graph kernels and deep graph neural networks on nine benchmark datasets333http://members.cbio.mines-paristech.fr/ nshervashidze/code/ widely used for testing the performance of graph kernels. We implemented our method in Matlab and utilized the C-MEX function444http://ai.stanford.edu/rubner/emd/default.htm for the computationally expensive component of EMD. All computations were carried out on a DELL system with Intel Xeon processors 272 at 2.93GHz for a total of 16 cores and 250 GB of memory, running the SUSE Linux operating system.

Datasets. We applied our method to widely-used graph classification benchmarks from multiple domains (Shervashidze et al., 2011; Vishwanathan et al., 2010; Yanardag and Vishwanathan, 2015b); MUTAG, PTC-MR, ENZYMES, PROTEINS, NCI1, and NCI109 are graphs derived from small molecules and macromolecules, and IMDB-B, IMDB-M, and COLLAB are derived from social networks. All datasets have binary labels except ENZYMES, IMDB-M, and COLLAB which have 6, 3 and 3 classes, respectively. All bioinformatics graph datasets have node labels while all other social network graphs have no node labels. Detailed descriptions of these 9 datasets, including statistical properties, are provided in the Appendix.

Baselines. Due to the large literature, we compare our method RGE against five representative global kernels related to our approach and three classical graph kernels, including EMD-based Indefinite Kernel (EMD) (Nikolentzos et al., 2017), Pyramid Match Kernel (PM) (Nikolentzos et al., 2017), Lovász Kernel (Lo-) (Johansson et al., 2014), Optimal Assignment Matching (OA-(A)) (Johansson and Dubhashi, 2015), Vertex Optimal Assignment Kernel (V-OA) (Kriege et al., 2016), Random Walk Kernel (RW) (Gärtner et al., 2003), Graphlet Kernel (GL) (Shervashidze et al., 2009), and Shortest Path Kernel (SP) (Borgwardt and Kriegel, 2005). Furthermore, we also compare RGE with several variants of Weisfeler-Leman Graph Kernel (WL-ST (Shervashidze et al., 2011), WL-SP (Shervashidze et al., 2011), and WL-OA-(A) (Johansson and Dubhashi, 2015)

). Finally, we compare RGE against four recently developed deep learning models with node labels, including Deep Graph Convolutional Neural Networks (DGCNN),

(Wu et al., 2017a); PATCHY-SAN (PSCN) (Niepert et al., 2016), Diffusion CNN (DCNN) (Atwood et al., 2016), and Deep Graphlet Kernel (DGK) (Yanardag and Vishwanathan, 2015b). The first three models are built on convolutional neural networks on graphs while the last one is based on Word2Vec model. Since WL test is a generic technique to utilize discrete node labels for improving many stand-alone graph kernels, in this study, we first focus on testing the capability of each graph kernel without node labels and then assess the performance of each graph kernel with plain node labels and with WL techniques 555Our approach to combine RGE(ASG) with WL techniques is to first use WL to generate new node labels and then apply RGE(ASG) with these node labels..

(a) ENZYMES
(b) NCI109
(c) IMDBBINARY
(d) COLLAB
(e) ENZYMES
(f) NCI109
(g) IMDBBINARY
(h) COLLAB
Figure 2. Test accuracies and runtime of three variants of RGE with and without node labels when varying .
(a) Number of graphs N
(b) Size of graph n
Figure 3. Runtime for computing node embeddings and RGE graph embeddings, and overall runtime when varying number of graphs N and size of graph n. (Default values: number of graphs , graph size , edge size ). Linear and quadratic complexity are also plotted for easy comparison.

Setup. Since RGE is a graph embedding, we directly employ a linear SVM implemented in LIBLINEAR (Fan et al., 2008)

since it can faithfully separate the effectiveness of our feature representation from the power of the nonlinear learning solvers. Following the convention of the graph kernel literature, we perform 10-fold cross-validation, using 9 folds for training and 1 for testing, and repeat the whole experiments ten times (thus 100 runs per dataset) and report the average prediction accuracies and standard deviations. The ranges of hyperparameters

and are [1e-3 1e-2 1e-1 1 10] and [3:3:30], respectively. All parameters of the SVM and hyperparameters of our method were optimized only on the training dataset. To eliminate the random effects, we repeat the whole experiments ten times and report the average prediction accuracies and standard deviations. For all baselines we have taken the best reported number from their papers. Since EMD is the closet method to ours, we execute both methods under the same setting for fair comparison and report both accuracy and computational time.

Impacts of on Accuracy and Runtime of RGE. We conducted experiments investigating the convergence behavior and the scalability of three variants of RGE with or without using node labels when increasing the number of random graphs. The hyperparameter is obtained from the previous cross-validations on the training set. We report both testing accuracy and runtime when increasing graph embedding size . As shown in Fig. 2, all variants of RGE converge very rapidly when increasing R from a small number (R = 4) to relatively large number (). This confirms our analysis in Theorem 3 that the RGE approximation can guarantee rapid convergence to the exact kernel. The second observation is that RGE exhibits quasi-linear scalability with respect to , as predicted by our computational analysis. This is particularly important for large scale graph data since most graph kernels have quadratic complexity in the number of graphs and/or in the size of graphs.

Datasets MUTAG PTC-MR ENZYMES NCI1 NCI019
RGE(RF) 86.33 1.39(1s) 59.82 1.42(1s) 35.98 0.89(38s) 74.70 0.56(727s) 72.50 0.32(865s)
RGE(ASG) 85.56 0.91(2s) 59.97 1.65 (1s) 38.52 0.91(18s) 74.30 0.45(579s) 72.70 0.42(572s)
EMD 84.66 2.69 (7s) 57.65 0.59 (46s) 35.45 0.93 (216s) 72.65 0.34 (8359s) 70.84 0.18 (8281s)
PM 83.83 2.86 59.41 0.68 28.17 0.37 69.73 0.11 68.37 0.14
Lo- 82.58 0.79 55.21 0.72 26.5 0.54 62.28 0.34 62.52 0.29
OA-(A) 79.89 0.98 56.77 0.85 36.12 0.81 67.99 0.28 67.14 0.26
RW 77.78 0.98 56.18 1.12 20.17 0.83 56.89 0.34 56.13 0.31
GL 66.11 1.31 57.05 0.83 18.16 0.47 47.37 0.15 48.39 0.18
SP 82.22 1.14 56.18 0.56 28.17 0.64 62.02 0.17 61.41 0.32
Table 1. Comparison of classification accuracy against graph kernel methods without node labels.
Datasets PTC-MR ENZYMES PROTEINS NCI1 NCI019
RGE(ASG) 61.5 2.34(1s) 48.27 0.99(28s) 75.98 0.71(20s) 76.46 0.45(379s) 74.42 0.30(526s)
EMD 57.67 2.11 (42s) 42.85 0.72 (296s) 76.03 0.28 (1936s) 75.89 0.16 (7942s) 73.63 0.33 (8073s)
PM 60.38 0.86 40.33 0.34 74.39 0.45 72.91 0.53 71.97 0.15
OA-(A) 58.76 0.92 43.56 0.66 69.83 0.30 68.96 0.35
V-OA 56.4 1.8 35.1 1.1 73.8 0.5 65.6 0.4 65.1 0.4
RW 57.06 0.86 19.33 0.62 71.67 0.78 63.34 0.27 63.51 0.18
GL 59.41 0.94 32.70 1.20 71.63 0.33 66.00 0.07 66.59 0.08
SP 60.00 0.72 41.68 1.79 73.32 0.45 73.47 0.11 73.07 0.11
WL-RGE(ASG) 62.20 1.67(1s) 57.97 1.16(38s) 76.63 0.82(30s) 85.85 0.42(401s) 85.32 0.29(798s)
WL-ST 57.64 0.68 52.22 0.71 72.92 0.67 82.19 0.18 82.46 0.24
WL-SP 56.76 0.78 59.05 1.05 74.49 0.74 84.55 0.36 83.53 0.30
WL-OA-(A) 59.72 1.10 53.76 0.82 84.75 0.21 84.23 0.19
Table 2. Comparison of classification accuracy against graph kernel methods with node labels or WL technique.
Datasets PTC-MR PROTEINS NCI1 IMDB-B IMDB-M COLLAB
(WL-)RGE(ASG) 62.20 1.67 76.63 0.82 85.85 0.42 71.48 1.01 47.26 0.89 76.85 0.34
DGCNN 58.59 2.47 75.54 0.94 74.44 0.47 70.03 0.86 47.83 0.85 73.76 0.49
PSCN 62.30 5.70 75.00 2.51 76.34 1.68 71.00 2.29 45.23 2.84 72.60 2.15
DCNN 56.6 1.20 61.29 1.60 56.61 1.04 49.06 1.37 33.49 1.42 52.11 0.53
DGK 57.32 1.13 71.68 0.50 62.48 0.25 66.96 0.56 44.55 0.52 73.09 0.25
Table 3. Comparison of classification accuracy against recent deep learning models on graphs.

Scalability of RGE varying graphs and nodes. We further assess the scalability of RGE when varying number of graphs and size of graph for randomly generated graphs. We change the number of graphs in the range of and the size of graph in the range of , respectively. When generating random adjacency matrices, we set the number of edges always be twice the number of nodes in a graph. We report the runtime for computing node embeddings using a state-of-the-art eigensolver (Wu et al., 2017b), generating RGE graph embeddings, and the overall computation of graph classification, accordingly. Fig. 3(a) shows the linear scalability of RGE when increasing the number of graphs, confirming our complexity analysis in the previous Section. In addition, as shown in Fig. 3(b), RGE still exhibits linear scalablity in computing eigenvectors but slightly quasi-linear scalablity in RGE generation time and overall time, when increasing the size of graph. This is because that even though RGE reduces conventional EMD’s complexity from super-cubic to (where D is a small constant), the log factor starts to show its impact on computing EMD between raw graphs and small random graphs when becomes large (e.g. close to 1000). Interestingly, with a state-of-the-art eigensolver, the complexity of computing a few eigenvectors is linearly proportional to the graph size (Wu et al., 2017b). This is highly desired property of our RGE embeddings, which open the door to large-scale applications of graph kernels for various applications such as social networks analysis and computational biology.

Comparison with All Baselines. Tables 1, 2, and 3 show that RGE consistently outperforms or matches other state-of-the-art graph kernels and deep learning approaches in terms of classification accuracy. There are several further observations worth making here. First, EMD, the closest method to RGE, shows good performance compared to most of other methods but often has significantly worse performance than RGE, highlighting the utility the novel graph kernel design using a feature map of random graphs and the effectiveness of a truly p.d. kernel. Importantly, RGE is also orders of magnitude faster than EMD in all cases, especially for data with a large graph size (like PROTEINS) or large number of graphs (like NCI1 and NCI109).

Second, the performance of RGE renders clear the importance of considering global properties graphs, and of having a distance measure able to align contextually-similar but positionally-different nodes, for learning expressive representations of graphs. In addition, as shown in Table 2, we observe that all methods (including RGE) gain performance benefits when considering the node label information or utilizing WL iterations based on node labels. With node label information, the gaps between RGE and other methods diminish but still showing very clear advantages of RGE.

Finally, as shown in Table 3, for biological datasets we used the WL-RGE(ASG) to obtain the best performance with WL iteration. For social network datasets, we used the RGE(ASG) without node label since there are no node labels on these datasets. Compared to supervised deep learning based approaches, our unsupervised RGE method yet still shows clear advantages, highlighting the importance of aligning the structural roles of each node when comparing two graphs. In contrast, most of deep learning based methods focus on node-level representations instead of graph-level representation (typically using mean-pooling), which cannot take into account these important structural roles of each node in graphs.

6. Conclusion and Future Work

In this work, we have presented a new family of p.d. and scalable global graph kernels that take into account global properties of graphs. The benefits of RGE are demonstrated by its much higher graph classification accuracy compared with other graph kernels and its (quasi-)linear scalability in terms of the number of graphs and graph size. Several interesting directions for future work are indicated: i) the graph embeddings generated by our technique can be applied and generalized to other learning problems such as graph (subgraph) matching or searching; ii) extensions of the RGE kernel for graphs with continuous node attributes and edge attributes should be explored.

References

  • R. Al-Rfou, D. Zelle, and B. Perozzi (2019) DDGK: learning graph representations for deep divergence graph kernels. arXiv:1904.09671. Cited by: footnote 1.
  • J. Atwood, S. Pal, D. Towsley, and A. Swami (2016) Sparse diffusion-convolutional neural networks. In NIPS, Cited by: §5.
  • F. Bach (2017) On the equivalence between kernel quadrature rules and random feature expansions.

    Journal of Machine Learning Research

    18 (21), pp. 1–38.
    Cited by: §2.2.
  • K. M. Borgwardt and H. Kriegel (2005) Shortest-path kernels on graphs. In Data Mining, Fifth IEEE International Conference on, pp. 8–pp. Cited by: §2.1, §5.
  • F. Bourgeois and J. Lassalle (1971) An extension of the munkres algorithm for the assignment problem to rectangular matrices. Communications of the ACM 14 (12), pp. 802–804. Cited by: §4.2.1.
  • P. Chen and L. Wu (2017) Revisiting spectral graph clustering with generative community models. In ICDM, pp. 51–60. Cited by: §1.
  • R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: a library for large linear classification. Journal of machine learning research 9 (Aug), pp. 1871–1874. Cited by: §B.3, §5.
  • M. Fisher, M. Savva, and P. Hanrahan (2011) Characterizing structural relationships in scenes using graph kernels. ACM Transactions on Graphics (TOG) 30 (4), pp. 34. Cited by: §1.
  • T. Gärtner, P. Flach, and S. Wrobel (2003) On graph kernels: hardness results and efficient alternatives. In Learning Theory and Kernel Machines, pp. 129–143. Cited by: §1, §2.1, §5.
  • D. Haussler (1999) Convolution kernels on discrete structures. Technical report Department of Computer Science, University of California at Santa Cruz. Cited by: §1, §2.1.
  • F. L. Hitchcock (1941) The distribution of a product from several sources to numerous localities. Studies in Applied Mathematics 20 (1-4), pp. 224–230. Cited by: §3.2.
  • T. Horváth, T. Gärtner, and S. Wrobel (2004) Cyclic pattern kernels for predictive graph mining. In KDD, pp. 158–167. Cited by: §2.1.
  • C. Ionescu, A. Popa, and C. Sminchisescu (2017) Large-scale data-dependent kernel approximation. In Artificial Intelligence and Statistics, pp. 19–27. Cited by: §4.2.2.
  • F. D. Johansson and D. Dubhashi (2015) Learning with similarity functions on graphs using matchings of geometric embeddings. In KDD, pp. 467–476. Cited by: §B.3, §1, §1, §2.1, §3.2, §4.2.2, §5.
  • F. Johansson, V. Jethava, D. Dubhashi, and C. Bhattacharyya (2014) Global graph kernels using geometric embeddings. In ICML, Cited by: §1, §2.1, §5.
  • R. Kondor and H. Pan (2016) The multiscale laplacian graph kernel. In NIPS, pp. 2990–2998. Cited by: §2.1.
  • N. M. Kriege, P. Giscard, and R. Wilson (2016) On valid optimal assignment kernels and applications to graph classification. In NIPS, pp. 1623–1631. Cited by: §2.1, §5.
  • M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger (2015) From word embeddings to document distances. In ICML, pp. 957–966. Cited by: §3.2.
  • Q. Le, T. Sarlós, and A. Smola (2013) Fastfood-approximating kernel expansions in loglinear time. In ICML, Vol. 85. Cited by: §4.2.1.
  • L. Lovász (1979) On the shannon capacity of a graph. IEEE Transactions on Information theory 25 (1), pp. 1–7. Cited by: §2.1.
  • F. Mokhtari and G. Hossein-Zadeh (2013) Decoding brain states using backward edge elimination and graph kernels in fmri connectivity networks. Journal of neuroscience methods 212 (2), pp. 259–268. Cited by: §1.
  • M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In ICML, pp. 2014–2023. Cited by: §5.
  • G. Nikolentzos, P. Meladianos, and M. Vazirgiannis (2017) Matching node embeddings for graph similarity.. In AAAI, pp. 2429–2435. Cited by: §1, §2.1, §3.2, §5.
  • A. Rahimi and B. Recht (2008) Random features for large-scale kernel machines. In NIPS, pp. 1177–1184. Cited by: §2.2, §2, §4.2.1.
  • L. Ralaivola, S. J. Swamidass, H. Saigo, and P. Baldi (2005) Graph kernels for chemical informatics. Neural networks 18 (8), pp. 1093–1110. Cited by: §1.
  • Y. Rubner, C. Tomasi, and L. J. Guibas (2000)

    The earth mover’s distance as a metric for image retrieval

    .

    International journal of computer vision

    40 (2), pp. 99–121.
    Cited by: §3.2, §3.2.
  • A. Rudi and L. Rosasco (2017) Generalization properties of learning with random features. In NIPS, pp. 3218–3228. Cited by: §4.2.2.
  • N. Shervashidze and K. M. Borgwardt (2009) Fast subtree kernels on graphs. In NIPS, pp. 1660–1668. Cited by: §1, §2.1.
  • N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1, §2.1, §5, §5.
  • N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In AIStats, pp. 488–495. Cited by: §2.1, §5.
  • A. Sinha and J. C. Duchi (2016) Learning kernels with random features. In NIPS, pp. 1298–1306. Cited by: §2.2.
  • J. Solomon, R. Rustamov, L. Guibas, and A. Butscher (2016) Continuous-flow graph transportation distances. arXiv:1603.06927. Cited by: §1.
  • A. Stathopoulos and J. R. McCombs (2010) PRIMME: preconditioned iterative multimethod eigensolver—methods and software description. ACM Transactions on Mathematical Software (TOMS) 37 (2), pp. 21. Cited by: §B.2, §3.1, §4.2.1.
  • S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt (2010) Graph kernels. Journal of Machine Learning Research 11, pp. 1201–1242. Cited by: §5.
  • U. Von Luxburg (2007)

    A tutorial on spectral clustering

    .
    Statistics and computing 17 (4), pp. 395–416. Cited by: §3.1.
  • C. Wagner, G. Wagener, R. State, and T. Engel (2009)

    Malware analysis with graph kernels and support vector machines

    .
    In Malicious and Unwanted Software, 2009 4th International Conference on, pp. 63–68. Cited by: §1.
  • L. Wang and H. Sahbi (2013) Directed acyclic graph kernels for action recognition. In ICCV, pp. 3168–3175. Cited by: §1.
  • B. Wu, Y. Liu, B. Lang, and L. Huang (2017a)

    DGCNN: disordered graph convolutional neural network based on the gaussian mixture model

    .
    arXiv:1712.03563. Cited by: §5.
  • L. Wu, P. Chen, I. E. Yen, F. Xu, Y. Xia, and C. Aggarwal (2018a) Scalable spectral clustering using random binning features. In KDD, pp. 2506–2515. Cited by: §2.2.
  • L. Wu, E. Romero, and A. Stathopoulos (2017b) PRIMME_SVDS: a high-performance preconditioned svd solver for accurate large-scale computations. SIAM Journal on Scientific Computing 39 (5), pp. S248–S271. Cited by: §B.2, §3.1, §4.2.1, §5.
  • L. Wu, I. E. Yen, J. Chen, and R. Yan (2016) Revisiting random binning features: fast convergence and strong parallelizability. In KDD, pp. 1265–1274. Cited by: §4.2.1.
  • L. Wu, I. E. Yen, F. Xu, P. Ravikuma, and M. Witbrock (2018b) D2KE: from distance to kernel and embedding. arXiv preprint arXiv:1802.04956. Cited by: §1, §2.2, §4.2.2.
  • L. Wu, I. E. Yen, J. Yi, F. Xu, Q. Lei, and M. Witbrock (2018c) Random warping series: a random features method for time-series embedding. In International Conference on Artificial Intelligence and Statistics, pp. 793–802. Cited by: §2.2.
  • P. Yanardag and S. Vishwanathan (2015a) A structural smoothing framework for robust graph comparison. In NIPS, pp. 2134–2142. Cited by: §2.1.
  • P. Yanardag and S. Vishwanathan (2015b) Deep graph kernels. In KDD, pp. 1365–1374. Cited by: §2.1, §5, §5.
  • Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai (2018) RetGK: graph kernels based on return probabilities of random walks. In NIPS, pp. 3968–3978. Cited by: §1.

Appendix A Appendix A: Proofs of Lemma 1 and Theorem 3

a.1. Proof of Lemma 1

Proof.

Since the geometric node embedding uses the normalized eigenvectors of the Laplacian matrix, we have that , i.e., belongs to a unit ball. Therefore, we can find an -covering of size for the unit ball. Next, we define as all the possible sets of of size no larger than . So we have . For any graph , we can find with also nodes such that . Then by the definition of EMD (1), a solution that assigns each node in to a node in would have overall cost less than , So . ∎

a.2. Proof of Proposition 2

Proof.

For any , we can find , such that

(10)

Write , then we have

(11)

Now we consider the second term.

(12)

Similarly, we can prove that the third term in (11) satisfies

(13)

Combining (12), (13), and the assumption , we obtain the desired result. ∎

a.3. Proof of Theorem 3

Proof.

Based on Proposition 2, we have

(14)

For any , since and , from the Hoeffding inequality, we have

(15)

Therefore,

(16)

Combining (14) and (16), and setting , we obtain the desired result. ∎

Appendix B Appendix B: Additional Experimental Results

Dataset MUTAG PTC ENZYMES PROTEINS NCI1 NCI109 IMDB-B IMDB-M COLLAB
Max # Nodes 28 109 126 620 111 111 136 89 492
Min # Nodes 10 2 2 4 3 4 12 7 32
Ave # Nodes 17.9 25.6 32.6 39.05 29.9 29.7 19.77 13.0 74.49
Max # Edges 33 108 149 1049 119 119 1249 1467 40119
Min # Edges 10 1 1 5 2 3 26 12 60
Ave # Edges 19.8 26.0 62.1 72.81 32.3 32.1 96.53 65.93 2457.34
# Graph 188 344 600 1113 4110 4127 1000 1500 5000
# Graph Labels 2 2 6 2 2 2 2 3 3
# Node Labels 7 19 3 3 37 38
Table 4. Properties of the datasets.

General Setup. We perform experiments to demonstrate the effectiveness and efficiency of the proposed method, and compare against total 12 graph kernels and deep graph neural networks on 9 benchmark datasets (as shown in Table 4) 666http://members.cbio.mines-paristech.fr/ nshervashidze/code/ that is widely used for testing the performance of graph kernels. We implement our method in Matlab and utilize C-MEX function 777http://ai.stanford.edu/ rubner/emd/default.htm for the computationally expensive component of EMD. To accelerate the computation, we use multithreading with total 12 threads in all experiments. All computations were carried out on a DELL dual socket system with Intel Xeon processors 272 at 2.93GHz for a total of 16 cores and 250 GB of memory, running the SUSE Linux operating system.

b.1. Additional Results and Discussions on Accuracy and Runtime of RGE Varying

Setup. We now conduct experiments to investigate the behavior of three variants of RGE with or without using node labels by varying the number of random graphs. The hyperparameter is obtained from the previous cross-validations on the training set. Depending on the size of graph on each dataset, we set in the range starting from 4 and ending with a number just satisfying . We report both testing accuracy and runtime when increasing graph embedding size .

b.2. Additional Results and Discussions on Scalability of RGE varying graphs and nodes

Setup. We assess the scalability of RGE when varying number of graphs and the size of a graph on randomly generated graphs. We change the number of graphs in the range of and the size of graph in the range of , respectively. When generating random adjacency matrices, we set the number of edges always be twice the number of nodes in a graph. We use the size of node embedding just like in the previous sections. We set the hyperparameters related to RGE itself are and . We report the runtime for computing node embeddings using state-of-the-art eigensolver (Stathopoulos and McCombs, 2010; Wu et al., 2017b) and RGE graph embeddings, and the overall runtime, respectively.

b.3. Additional Results and Discussions on Comparisons Against All Baselines

Setup. Since RGE is a graph embedding, we directly employ a linear SVM implemented in LIBLIBNEAR (Fan et al., 2008) since it can faithfully examine the effectiveness of our feature representation from the power of the nonlinear learning solvers. Following the convention in the graph kernel literature, we perform 10-fold cross-validation, using 9 folds for training and 1 for testing, and repeat the whole experiments ten times (thus 100 runs per dataset) and report the average prediction accuracies and standard deviations. The ranges of hyperparameters and are [1e-3 1e-2 1e-1 1 10] and [3:3:30], respectively. All parameters of the SVM and hyperparameters of our method were optimized only on the training dataset. The node embedding size is set to either 4, 6 or 8 but always be the same number for all variants of RGE on the same datasets. To eliminate the random effects, we repeat the whole experiments ten times and report the average prediction accuracies and standard deviations. For all baselines we take the best number reported in the papers except EMD, where we rerun the experiments for fair comparisons in terms of both accuracy and runtime. Since GRE, OA, EMD, and PM are essentially built on the same node embeddings from the adjacency matrices, we take the number of OA-(A) in (Johansson and Dubhashi, 2015) for a fair comparison.