In several domains like chemo- and bioinformatics as well as social network and image analysis structured objects appear naturally. Graph kernels are a key concept for the application of kernel methods to structured data and various approaches have been developed in recent years, see Vis+2010Neu+2015, and references therein. The considered graphs can be distinguished in (i) graphs with discrete labels, e.g., molecular graphs, where nodes are annotated by the symbols of the atoms they represent, and (ii) attributed graphs with (multi-dimensional) real-valued labels in addition to discrete labels. Attributed graphs often appear in domains like bioinformatics [Borgwardt2005a] or image classification [Harchaoui2007], where attributes may represent physical properties of protein secondary structure elements or RGB values of colors, respectively. Taking the continuous information into account has been proven empirically to be beneficial in several applications, e.g., see [Neu+2015, Borgwardt2005a, Fer+2013, Harchaoui2007, Kri+2012, Ors+2015].
Kernels are equivalent to the inner product in an associated feature space, where a feature map assigns the objects of the input space to a feature vector. The various graph kernels proposed in recent years can be divided into approaches that either compute feature maps (i) explicitly, or (ii) implicitly[Kri+2014]
. Explicit computation schemes have been shown to be scalable and allow the use of fast linear support vector classifiers, e.g.,[Joa+2006], while implicit computation schemes are often slow. Alternatively, we may divide graph kernels according to their ability to handle annotations of nodes and edges. The proposed graph kernels are either (i) restricted to discrete labels, or (ii) compare annotations like continuous values by user-specified kernels. Typically kernels of the first category implicitly compare annotations of nodes and edges by the Dirac kernel, which requires values to match exactly and is not adequate for continuous values. The two classifications of graph kernels mentioned above largely coincide: Graph kernels supporting complex annotations use implicit computation schemes and do not scale well. Whereas graphs with discrete labels can be compared efficiently by graph kernels based on explicit feature maps. This is what we make use of to develop a unifying treatment. But first, let us touch upon related work.
I-a Previous work
In recent years, various graph kernels have been proposed. In [Gaertner2003] and [Kashima2003] graph kernels were proposed based on random walks, which count the number of walks two graphs have in common. Since then, random walk kernels have been studied intensively, e.g., [Kang2012, Kri+2014, Mahe2004, Vis+2010]. Kernels based on tree patterns were initially proposed in [Ramon2003]. These two approaches were originally applied to graphs with discrete labels, but the method of implicit computation supports comparing attributes by user-specified kernel functions. Kernels based on shortest paths [Bor+2005] are computed by performing -step walks on the transformed input graphs, where edges are annotated with shortest-path lengths. A drawback of the approaches mentioned above is their high computational cost.
A different line in the development of graph kernels focused particularly on scalable graph kernels. These kernels are typically computed efficiently by explicit feature maps, but are severely limited to graphs with discrete labels. Prominent examples are kernels based on subgraphs up to a fixed size, e.g., [She+2009], or specific subgraphs like cycles and trees [Horvath2004]. Other approaches of this category encode the neighborhood of every node by different techniques [Hido2009, Neu+2015, She+2011].
Recently, several kernels specifically designed for graphs with continuous attributes were proposed [Fer+2013, Kri+2012, Ors+2015], and their experimental evaluation confirms the importance of handling continuous attributes adequately.
Several articles on scalable kernels for graphs with discrete labels propose the adaption of their approach to graphs with continuous attributes as future work, e.g., see [She+2009, She+2011]. Yet, only little work in this direction has been reported, which is most likely due to the fact that this in general is a non-trivial task. An immediate approach is to discretize continuous values by binning. A key problem of this method is that two values, which only differ marginally may still fall in different bins and are then considered non-matching. Still, promising experimental results of such approaches have been reported for certain data sets, e.g., [Neu+2015].
I-B Our Contribution
We introduce hash graph kernels for graphs with continuous attributes. This family of kernels is obtained by a generic method, which iteratively hashes continuous attributes to discrete labels in order to apply a base kernel for graphs with discrete labels. This allows to construct a single combined feature vector for a graph from the individual feature vectors of each iteration. The essence of this approach is:
The hash graph kernel framework lifts every graph kernel that supports discrete labels to a kernel which can handle continuous attributes.
We exemplify this for two established graph kernels:
We obtain a variation of the Weisfeiler-Lehman subtree kernel, which implicitly employs a non-trivial kernel on the node and edge annotations and is suitable for continuous values.
Moreover, we derive a variant of the shortest-path kernel which also supports continuous attributes while being efficiently computable by explicit feature maps.
For both kernels we provide a detailed theoretical analysis. Moreover, the effectiveness of these kernels is demonstrated in an extensive experimental study on real-world and synthetic data sets. The results show that hash graph kernels are orders of magnitude faster than state-of-the-art kernels for attributed graphs without drop in classification accuracy.
An (undirected) graph is a pair with a finite set of nodes and a set of edges . We denote the set of nodes and the set of edges of by and , respectively. For ease of notation we denote the edge in by or . Moreover, denotes the neighborhood of in , i.e., . An attributed graph is a graph endowed with an attribute function for . We say that is an attribute of for in . A labeled graph is an attributed graph with an attribute function , where the codomain of is restricted to a (finite) alphabet, e.g., a finite subset of the natural numbers. Analogously, we say that is a label of in .
Let be a non-empty set and let be a function. Then is a kernel on if there is a real Hilbert space and a mapping such that for and in , where denotes the inner product of . We call a feature map, and a feature space. Let be a non-empty set of (attributed) graphs, then a kernel is called graph kernel. We denote by the Dirac kernel with if , and otherwise.
Iii Hash Graph Kernels
In this section we introduce hash graph kernels. The main idea of hash graph kernels is to map attributes to labels using a family of hash functions and then apply a kernel for graphs with discrete labels.
Let be a family of hash functions and a graph with attribute function . We can transform to a graph with discrete labels by mapping each attribute to with some function in . For short, we write for the labeled graph obtained by this procedure. The function is drawn at random from the family of hash functions
. This procedure is repeated multiple times in order to lower the variance. Thus, we obtain a sequence of discretely labeled graphs, where is the number of iterations. Hash graph kernels compare these sequences of labeled graphs by an arbitrary graph kernel for labeled graphs, which we refer to as discrete base graph kernel, e.g., the Weisfeiler-Lehman subtree or the shortest-path kernel.
Definition 1 (Hash graph kernel).
Let be a family of hash functions and a discrete base graph kernel, then the hash graph kernel for two attributed graphs and is defined as
where is obtained by choosing hash functions from .
We will discuss hash functions, possible ways to choose them from and how they relate to the global kernel value in Section III-B and Section IV. We proceed with the algorithmic aspects of hash graph kernels. It is desirable for efficiency to compute explicit feature maps for graph kernels. We can obtain feature vectors for hash graph kernels under the assumption that the discrete base graph kernel can be computed by explicit feature maps. This can be achieved by concatenating the feature vectors for each iteration and normalizing the combined feature maps by according to the pseudocode in Algorithm 1.
Since hash graph kernels are a normalized sum over discrete base graph kernels applied to a sequence of transformed input graphs, it is clear that we again obtain a valid kernel.
For the explicit computation of feature maps by Algorithm 1 we get the following bound on the running time.
Proposition 1 (Running Time).
Algorithm 1 computes the hash graph kernel feature map for a graph in time where denotes the running time to evaluate the hash functions for and the running time to compute the graph feature map of the discrete base graph kernel for .
Directly follows from Algorithm 1. ∎
Notice that when we fix the number of iterations and assume , the hash graph kernel can be computed in the same asymptotic running time as the discrete base graph kernel. Moreover, notice that lines 4 to 5 in Algorithm 1 can be easily executed in parallel.
Iii-B Hash Functions
In this section we discuss possible realizations of the hashing technique used to obtain hash graph kernels according to Definition 1.
The key idea is to choose a family of hash functions and draw hash functions and in each iteration such that is an adequate measure of similarity between attributes and in . For the case that drawn at random, such families of hash functions have been proposed, e.g., see [Rah+2008, Andoni2008, Dat+2004]. Unfortunately, these results do not lift to kernels composed of products of base kernels. Thus they do not directly transfer to hash graph kernels, where complex discrete base graph kernels are employed. For example, let be the hat kernel on and a hash function, such that , see [Rah+2008]. However, in general
To overcome this issue, we introduce the following concept.
Let be a kernel and let for some set be a family of hash functions. Then is an independent -hash family if where and are chosen independently and uniformly at random from .
Iv Hash Graph Kernel Instances
In the following we prove that hash graph kernels approximate implicit variants of the shortest-path and the Weisfeiler-Lehman subtree kernel for attributed graphs.
Iv-a Shortest-path kernel
We first describe the implicit shortest-path kernel which can handle attributes. Let be an attributed graph and let denote the length of the shortest path between and in . The kernel is then defined as
Here is a kernel for comparing node labels or attributes and is a kernel to compare shortest-path distances, such that if or .
If we set and to the Dirac kernel, we can compute an explicit mapping for the kernel : Assume a labeled graph , then each component of counts the number of occurrences of a triple of the form for in , , and . It is easy to see that
The following theorem shows that the hash graph kernel approximates arbitrarily close by using the explicit shortest-path kernel as a discrete base kernel and an independent -hash family.
Theorem 1 (Approximation of implicit shortest-path kernel for continuous attributes).
Let be a kernel and let be an independent -hash family. Assume that in each iteration of Algorithm 1 each attribute is mapped to a label using a hash function chosen independently and uniformly at random from . Then Algorithm 1 with the explicit shortest-path kernel acting as the discrete base kernel approximates such that
Moreover with any constant probability,
Moreover with any constant probability,
By assumption, we have for and chosen independently and uniformly at random from . Since we are using a Dirac kernel to compare discrete attributes, we get . Since is an independent -hash family, for , and chosen independently and uniformly at random from . Hence, by the above, By Eq. 1 and using the linearity of expectation, Now assume that is normalized to , then the first claim follows from the Hoeffding bound [Hoe1963]. In order to derive the second claim, we choose with a large enough constant , where is the number of iterations in Algorithm 1. From the first claim, we get
The claim then follows from the Union bound. ∎
Notice that we can also approximate by employing a -independent hash family.
Iv-B Weisfeiler-Lehman subtree kernel
By the same arguments, we can derive a similar result for the Weisfeiler-Lehman subtree kernel. The following Proposition derives an implicit version of the Weisfeiler-Lehman subtree kernel.
Proposition 2 (Implicit Weisfeiler-Lehman subtree kernel).
where is the set of bijections between and such that for . Then is equivalent to the Weisfeiler-Lehman subtree kernel.
Follows from [She+2011, Theorem 8]. ∎
We show that Algorithm 1 with the (explicit) Weisfeiler-Lehman subtree kernel acting as the discrete base graph kernel probabilistically approximates the graph kernel , where is defined by substituting in the definition of by a kernel .
Corollary 1 (Approximation of implicit Weisfeiler-Lehman subtree kernel for continous attributes).
Let be a kernel and let be an independent -hash family. Assume that in each iteration of Algorithm 1 each attribute is mapped to a label using a hash function chosen independently and uniformly at random from . Then Algorithm 1 with the Weisfeiler-Lehman subtree kernel with iterations acting as the discrete base kernel approximates such that
Moreover with any constant probability,
for > 0.
Since is written as a sum of products and sums of Dirac kernels, we can again use the property of -independent hash functions and argue analogously to the proof of Theorem 1. ∎
V Experimental Evaluation
Our intention here is to investigate the benefits of hash graph kernels compared to the state-of-the-art. More precisely, we address the following questions:
How do hash graph kernels compare to state-of-the-art graph kernels for attributed graphs in terms of classification accuracy and running time?
How does the choice of the discrete base kernel influence the classification accuracy?
Does the number of iterations influence the classification accuracy of hash graph kernels in practice?
V-a Data Sets and Graph Kernels
We used the following data sets to evaluate and compare hash graph kernels: Enzymes [Borgwardt2005a, Fer+2013], Frankenstein [Ors+2015], Proteins [Borgwardt2005a, Fer+2013], SyntheticNew [Fer+2013], and Synthie.111Due to space constraints we refer to https://ls11-www.cs.uni-dortmund.de/staff/morris/graphkerneldatasets for descriptions, references, and statistics. The data set Synthie consists of 400 graphs, subdivided into four classes, with 15 real-valued node attributes. It was generated as follows: First, we generated two Erdős-Rényi graphs and on vertices with edge probability . From each we generated a seed set for in of 200 graphs by randomly adding or deleting 25% of the edges of . Connected graphs were obtained by randomly sampling 10 seeds and randomly adding edges between them. We generated the class of 200 graphs, choosing seeds with probability 0.8 from and 0.2 from and the class with interchanged probabilities. Finally, we generated a set of real-valued vectors of dimension 15, subdivided into classes and , following the approach of [Guy+2003]. We then subdivided into two classes and by drawing a random attribute for each node. For the class (), a node stemming from a seed in () was annotated by an attribute from , and from otherwise.
We implemented hash graph kernels with the explicit shortest-path graph kernel (HGK-SP) and the Weisfeiler-Lehman subtree kernel (HGK-WL) acting as discrete base kernels in Python.222The source code can be obtained from https://ls11-www.cs.uni-dortmund.de/people/morris/hashgraphkernel.zip. We compare our kernels to the GraphHopper kernel (GH) [Fer+2013], an instance of the graph invariant kernels (GI) [Ors+2015], and the propagation kernel from [Neu+2015] which support continuous attributes (P2K). Additionally, we compare our kernel to the Weisfeiler-Lehman subtree kernel and the explicit shortest-path kernel (SP), which only take discrete label information into account, to exemplify the usefulness of using continuous attributes. Since the Frankenstein, SyntheticNew, and Synthie data set do not have discrete labels, we used node degrees as labels instead. For GI we used the original Python implementation provided by the author of [Ors+2015]. The variants of the hash graph kernel are computed on a single core only. For GH and P2K we used the original Matlab implementation provided by the authors of [Fer+2013] and [Neu+2015], respectively.
V-B Experimental Protocol
For each kernel, we computed the normalized gram matrix. For explicit kernels we computed the gram matrix via the linear kernel. Note that the running times of the hash graph kernels could be further reduced by employing linear kernel methods.
We computed the classification accuracies using the C-SVM implementation of LIBSVM [Cha+11], using 10-fold cross validation. The -parameter was selected from
by 10-fold cross validation on the training folds. We repeated each 10-fold cross validation ten times with different random folds and report average accuracies and standard deviations. Since the hash graph kernels andP2K are randomized algorithms we computed each gram matrix ten times and report average classification accuracies and running times. We report running times for WL, HGK-WL, and P2K with refinement steps.
We fixed the number of iterations of the hash graph kernels for all but the Synthie data set to 20, since this was sufficient to obtain state-of-the-art classification accuracies. For the Synthie data set we set the number of iterations to 100, which indicates that this data set is harder to classify. The number of refinement/propagation steps for WL, HGK-WL, and P2K was selected from using 10-fold cross validation on the training folds only. For the hash graph kernels we centered the attributes dimensionwise to the mean, scaled to unit variance, and used -stable LSH [Dat+2004] as hash functions to hash attributes to discrete labels. For the Enzymes, the Proteins, SyntheticNew, and Synthie data set we set the interval length to , see [Dat+2004]. Due to the high dimensional sparse attributes of the Frankenstein data set we set the interval length to . For each hash graph kernel we report classification accuracy and running time with and without taking discrete labels into account. For the HGK-WL we propagated label and hashed attribute information separately. In order to speed up the computation we used the same LSH hash function for all attributes in one iteration.
For the graph invariant kernel we used the LWL variant, which has been reported to perform overall well on all data sets [Ors+2015]. The implementation is using parallelization to speed up the computation and we set the number of parallel processes to eight. For GH and GI we used the Gaussian RBF kernel to compare node attributes. For all the data sets except Frankenstein, we set the parameter of the RBF kernel to 1/(Dimension of attribute vector), see [Fer+2013, Ors+2015]. For Frankenstein, we set [Ors+2015].
Moreover, in order to study the influence of the number of iterations of the hash graph kernels, we computed classification accuracies and running times of hash kernels with 1 to 50 iterations on the Enzymes data set. Running times were averaged over ten independent runs.
All experiments were conducted on a workstation with an Intel Core firstname.lastname@example.orgGHz and 16 GB of RAM running Ubuntu 14.04 LTS with Python 2.7.6 and Matlab R2015b.
V-C Results and Discussion
|Graph Kernel||Data Set|
|Graph Kernel||Data Set|
|WL||53.97 (||73.53 ()||75.02 ()||98.57 (||53.60 (|
|SP||42.88 ()||69.51(||75.71 ()||83.30(||53.78 (|
|HGK-SP||66.73 ()||71.30 ()||65.84 ()||70.06 ()||75.14 ()||77.47 ()||80.55 ()||96.46 )||86.27 ()||94.34 ()|
|HGK-WL||63.94 ()||67.63 ()||73.16 ()||73.62 ()||74.88 ()||76.70 ()||97.57 ()||98.84 ()||80.25 ()||96.75 ()|
|GH||–||68.80 ()||–||68.48 ()||–||72.26 ()||–||85.10 ()||–||73.18 ()|
|GI||–||71.70 ()||–||76.31 ()||–||76.88 ()||–||83.07 ()||–||95.75 ()|
|P2K||–||69.22 ()||–||OOM||–||73.45 ()||–||91.70 ()||–||50.15 ()|
In the following we answer questions Q1–Q3.
In terms of classification accuracies HGK-WL achieves state-of-the-art results on the Proteins and the SyntheticNew data set. Notice that the WL kernel, without using attribute information, achieves the same classification accuracy on SyntheticNew. This indicates that on this data set the attributes are only of marginal relevance for classification. A different result is observed for the other data sets. On the Synthie data set HGK-WL achieves the overall best accuracy and is more than 20% better than GH and 40% better than P2K. The kernel almost achieves state-of-the art classification accuracy on the Frankenstein data set. Notice that the parameter of the RBF kernel used in GI and GH was finely tuned.
HGK-SP achieves state-of-the-art classification accuracy on the Enzymes and Proteins data set and compares favorably on the SyntheticNew data set. On the Frankenstein data set, we observed better classification accuracy than GH. Moreover, it performs also well on the Synthie data set.
In terms of running times both instances of the hash graph kernel framework perform very well. On all data sets HGK-WL obtains running times that are several orders of magnitude faster than GH and GI.
As Table II shows, the choice of the discrete base kernel has major influence on the classification accuracy for some data sets. On the Enzymes data sets HGK-SP performs very favorably, while HGK-WL achieves higher classification accuracies on the Frankenstein data set. On the Proteins and the SyntheticNew data sets both hash graph kernels achieve similar results. HGK-WL performs slightly better on the Synthie data set.
Fig. 1 illustrates the influence of the number iterations on HGK-SP and HGK-WL on the Enzymes data set. Both plots show that a higher number of iterations leads to better classification accuracies while the running time grows linearly. In case of the HGK-SP, the classification accuracy on the Enzymes data set improves by more than 12% when using 20 instead of 1 iterations. The improvement on the Enzymes data set is even more substantial for HGK-WL: the classification accuracy improves by more than 16%. At about 30 and 40 iterations for the HGK-SP and HGK-WL, respectively, the algorithms reach a point of saturation.
Vi Conclusion and Future Work
We have introduced the hash graph kernel framework which allows applying the various existing scalable and well-engineered kernels for graphs with discrete labels to graphs with continuous attributes. The derived kernels outperform other kernels tailored to attributed graphs in terms of running time without sacrificing classification accuracy.
Moreover, we showed that the hash graph kernel framework approximates implicit variants of the shortest-path and the Weisfeiler-Lehman subtree kernel with an arbitrary small error.
This work was supported by the German Science Foundation (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Data Analysis”, project A6 “Resource-efficient Graph Mining”. We thank Aasa Feragen, Marion Neumann, and Franceso Orsini for providing us with data sets and source code.