A Unifying View of Explicit and Implicit Feature Maps for Structured Data: Systematic Studies of Graph Kernels

03/02/2017 ∙ by Nils M. Kriege, et al. ∙ TU Dortmund Washington University in St Louis 0

Non-linear kernel methods can be approximated by fast linear ones using suitable explicit feature maps allowing their application to large scale problems. To this end, explicit feature maps of kernels for vectorial data have been extensively studied. As many real-world data is structured, various kernels for complex data like graphs have been proposed. Indeed, many of them directly compute feature maps. However, the kernel trick is employed when the number of features is very large or the individual vertices of graphs are annotated by real-valued attributes. Can we still compute explicit feature maps efficiently under these circumstances? Triggered by this question, we investigate how general convolution kernels are composed from base kernels and construct corresponding feature maps. We apply our results to widely used graph kernels and analyze for which kernels and graph properties computation by explicit feature maps is feasible and actually more efficient. In particular, we derive feature maps for random walk and subgraph matching kernels and apply them to real-world graphs with discrete labels. Thereby, our theoretical results are confirmed experimentally by observing a phase transition when comparing running time with respect to label diversity, walk lengths and subgraph size, respectively. Moreover, we derive approximative, explicit feature maps for state-of-the-art kernels supporting real-valued attributes including the GraphHopper and Graph Invariant kernels. In extensive experiments we show that our approaches often achieve a classification accuracy close to the exact methods based on the kernel trick, but require only a fraction of their running time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Analyzing complex data is becoming more and more important. In numerous application domains, e.g., chem- and bioinformatics, neuroscience, or image and social network analysis, the data is structured and hence can naturally be represented as graphs. To achieve successful learning we need to exploit the rich information inherent in the graph structure and the annotations of vertices and edges. A popular approach to mining structured data is to design graph kernels measuring the similarity between graphs. The graph kernel can then be plugged into a kernel machine, such as support vector machine or Gaussian process, for efficient learning and prediction.

The kernel-based approach to predictive graph mining requires a positive semidefinite (p.s.d.) kernel function between graphs. Graphs, composed of labeled vertices and edges possibly enriched with continuous attributes, however, are not fixed-length vectors but rather complicated data structures, and thus standard kernels cannot be used. Instead, the general strategy to design graph kernels is to decompose graphs into small substructures among which kernels are defined following the concept of convolution kernels due to Haussler (1999). The graph kernel itself is then a combination of the kernels between the possibly overlapping parts. Hence the various graph kernels proposed in the literature mainly differ in the way the parts are constructed and in the similarity measure used to compare them. Most of them can be seen as instances of convolution kernels (Vishwanathan et al., 2010). Moreover, existing graph kernels also differ in their ability to exploit annotations, which may be categorical labels or real-valued attributes on the vertices and edges.

We remind basic facts on kernels, which have decisive implications on several computational aspects. A kernel on a non-empty set is a positive semidefinite function . Equivalently, a function is a kernel if there is a feature map to a real Hilbert space with inner product , such that for all and in . This equivalence yields two algorithmic strategies to compute kernels on graphs:

  1. One way is functional computation, e.g., from closed-form expressions. In this case the feature map is not necessarily known and the feature space may be of infinite dimension. Therefore, we refer to this approach closely related to the famous kernel trick as implicit computation.

  2. The other strategy is to compute the feature map for each graph explicitly to obtain the kernel values from the dot product between pairs of feature vectors. These feature vectors commonly count how often certain substructures occur in a graph.

Linear kernel methods based on explicit feature maps can often be implemented efficiently and are therefore preferred over their kernelized counterparts recently in practice. When feature maps are computed explicitly, the structured data is essentially transformed into vectorial data in a preprocessing step. Unique advantages of the implicit computation on the other hand are that

  1. kernels for composed objects can be combined of established kernels on their parts exploiting well-known closure properties of kernels;

  2. the number of possible features may be high—in theory infinite—while the function remains polynomial-time computable.

Previously proposed graph kernels that are computed implicitly exploit at least one of the above mentioned advantages. Graph kernels computed by explicit feature maps do not allow to specify base kernels for parts like continuous vertex annotations, but scale to large graphs and datasets. We review concrete graph kernels with respect to this difference in Section 2 and proceed by summarizing our contribution.

1.1 Our Contribution

So far, previous work introducing novel graph kernels followed one of the strategies for computation. In contrast, we are interested in analyzing and comparing the computation schemes. We study under which conditions the computation of an explicit mapping from graphs to finite-dimensional feature spaces is feasible. To achieve our goal, we review closure properties of kernels and the corresponding feature maps with a focus on the size and sparsity of the feature vectors. Building on this we derive explicit feature maps for convolution kernels and assess how the properties of the graph in combination with the properties of the base kernel affect the running time. We theoretically analyze both methods of computation and identify a trade-off between running time and flexibility. We apply these results to obtain new algorithms for explicit graph kernel computation. We introduce the class of weighted vertex kernels and show that it generalizes state-of-the-art kernels for graphs with continuous attributes, namely the GraphHopper kernel (Feragen et al., 2013) and an instance of the Graph Invariant kernels (Orsini et al., 2015). We derive approximative, explicit feature maps for both based on approximate feature maps for the base kernels.

Then, we derive explicit computation schemes for random walk kernels (Gärtner et al., 2003; Vishwanathan et al., 2010), subgraph matching kernels (Kriege and Mutzel, 2012), and shortest-path kernels (Borgwardt and Kriegel, 2005). We compare efficient algorithms for the explicit and implicit computation of these kernels experimentally. Our product graph based computation of the walk kernel fully supports arbitrary vertex and edge kernels and exploits their sparsity. Further, we present the first explicit computation scheme for walk-based kernels. Given this, we are finally able to experimentally compare the running times of both computation strategies systematically with respect to the label diversity, data set size, and substructure size, i.e., walk length and subgraph size. As it turns out, there exists a computational phase transition for walk and subgraph kernels. Our experimental results for weighted vertex kernels show that their computation by explicit feature maps is feasible and provides a viable alternative even when comparing graphs with continuous attributes.

1.1.1 Extension of the Conference Paper

The present paper is a significant extension of a previously published conference paper (Kriege et al., 2014). In the following we list the main contributions that were not included in the conference version.

Feature maps of composed kernels.

We review closure properties of kernels, the corresponding feature maps and the size and sparsity of the feature vectors. Based on this, we obtain explicit feature maps for convolution kernels with arbitrary base kernels. This generalizes the result of the conference paper, where binary base kernel were considered.

Weighted vertex kernels.

We introduce weighted vertex kernels generalizing two kernels for attributed graphs.

Application of explicit feature maps.

We derive explicit feature maps for weighted vertex kernels and the shortest-path kernel (Borgwardt and Kriegel, 2005) supporting arbitrary base kernels for the comparison of attributes.

Experimental evaluation.

We largely extended our evaluation, which now includes experiments for the novel computation schemes of graph kernels as well as a comparison between a graphlet kernel and the subgraph matching kernel (Kriege and Mutzel, 2012).

1.2 Outline

The article is organized as follows. In Section 2 we discuss related work and proceed by fixing the notation in Section 3. In Section 4, we discuss the computation of explicit and implicit kernels. Section 5 reviews closure properties of kernels and the corresponding feature maps. Moreover, we derive feature maps for the convolution kernel. Although the analysis and results presented in Sections 4 and 5 are valid for kernels in general, we give concrete examples arising in the domain of graph data. Subsequently, we derive feature maps for the shortest-path graph kernel, discuss the graphlet and the subgraph matching kernel, introduce the weighted vertex kernel and derive approximate feature maps. Moreover, we derive feature maps for the fixed length walk kernel and discuss different computation schemes. Section 7 presents the results of our experimental evaluation.

2 Related Work

In the following we review existing graph kernels based on explicit or implicit computation. For random walk kernels implicit computation schemes based on product graphs have been proposed. The product graph has a vertex for each pair of vertices in the original graphs. Two vertices in the product graph are neighbors if the corresponding vertices in the original graphs were both neighbors as well. Product graphs have some nice properties making them suitable for the computation of graph kernels. First, the adjacency matrix of a product graph is the Kronecker product of the adjacency matrices and of the original graphs, i.e., , same holds for the weight matrix when employing an edge kernel. Further, there is a one-to-one correspondence between walks on the product graph and simultaneous walks on the original graphs, cf. (Gärtner et al., 2003). The random walk kernel introduced by Vishwanathan et al. (2010) is now given by where and

are starting and stopping probability distributions and

coefficients such that the sum converges. Several variations of the random walk kernel have been introduced in the literature. Instead of considering weights or probabilities, the geometric random walk kernel introduced by Gärtner et al. (2003) counts the number of matching walks. Other variants of random walk kernels have been proposed, cf. (Kashima et al., 2003; Mahé et al., 2004; Borgwardt et al., 2005; Harchaoui and Bach, 2007; Kang et al., 2012). See also (Sugiyama and Borgwardt, 2015) for some recent theoretical results on random walk kernels. Another substructure used to measure the similarity among graphs are shortest paths. Borgwardt and Kriegel (2005) proposed the shortest-path kernel, which compares two graphs based on vertex pairs with similar shortest-path lengths. The GraphHopper kernel compares the vertices encountered while hopping along shortest paths (Feragen et al., 2013). The above mentioned approaches support graphs with continuous attributes, further kernels for this application exist (Orsini et al., 2015; Su et al., 2016). Also computed via a product graph, the subgraph matching kernel compares subgraphs of small size allowing to rate mappings between them according to vertex and edge kernels (Kriege and Mutzel, 2012).

Avoiding the construction of potentially huge product graphs, explicit feature maps for graph kernels can often be computed more memory efficient and also faster. The features are typically counts or indicators of occurrences of substructures of particular sizes. The graphlet kernel, for example, counts induced subgraphs of size of unlabeled graphs according to , where and are the count features of and , respectively, cf. (Shervashidze et al., 2009). The cyclic pattern kernel measures the occurrence of cyclic and tree patterns and maps the graphs to pattern indicator features which are independent of the pattern frequency, cf. (Horváth et al., 2004). The Weisfeiler-Lehman subtree kernel counts label-based subtree-patterns, cf. (Shervashidze et al., 2011), according to , where and is a feature vector counting subtree-patterns in of depth . A subtree-pattern is a tree rooted at a particular vertex where each level contains the neighbors of its parent vertex; the same vertices can appear repeatedly. Other graph kernels on subtree-patterns have been proposed in the literature (Ramon and Gärtner, 2003; Harchaoui and Bach, 2007; Bai et al., 2015; Hido and Kashima, 2009). In a similar spirit, the propagation kernel iteratively counts similar label or attribute distributions to create an explicit feature map for efficient kernel computation (Neumann et al., 2016).

The large amount of recently introduced graph kernels indicates that machine learning on structured data is both considerably difficult and important. Surprisingly, none of the above introduced kernels is flexible enough to consider any kind of vertex and edge information while still being fast and memory efficient across arbitrary graph databases. The following observation is crucial. Graph kernels supporting complex annotations typically use implicit computation schemes and do not scale well. Whereas graphs with discrete labels are efficiently compared by graph kernels based on explicit feature maps. Recently the

hash graph kernel framework (Morris et al., 2016) has been proposed to obtain efficient kernels for graphs with continuous labels from those proposed for discrete ones. The idea is to iteratively turn continuous attributes into discrete labels using randomized hash functions. A drawback of the approach is that so-called independent -hash families must be known to guarantee that the approach approximates attribute comparisons by the kernel . In practice locality-sensitive hashing is used, which does not provide this guarantee, but still achieves promising results. Apart from this approach no results on explicit feature maps for kernels on graphs with continuous attributes are known. However, explicit feature maps of kernels for vectorial data have been studied extensively. Starting with the seminal work by Rahimi and Recht (2008), explicit feature maps of various popular kernels have been proposed, cf. (Vedaldi and Zisserman, 2012; Kar and Karnick, 2012; Pham and Pagh, 2013, and references therein). We build on this line of work to obtain kernels for graphs, where individual vertices and edges are annotated by vectorial data. In contrast to the hash graph kernel framework our goal is to lift the known approximation results for kernels on vectorial data to kernels for graphs annotated with vector data.

3 Preliminaries

An (undirected) graph is a pair with a finite set of vertices and a set of edges . We denote the set of vertices and the set of edges of by and , respectively. For ease of notation we denote the edge in by or . A graph is a subgraph of a graph if and . The subgraph is said to be induced if and we write . We denote the neighborhood of a vertex in by .

A labeled graph is a graph endowed with an label function , where is a finite alphabet. We say that is the label of for in . An attributed graph is a graph endowed with a function , , and we say that is the attribute of . We denote the base kernel for comparing vertex labels and attributes by and, for short, write instead of . The above definitions directly extend to graphs, where edges have labels or attributes and we denote the base kernel by . We refer to and as vertex kernel and edge kernel, respectively.

For a vector in , we denote by be the set of indices of the non-zero components of and let the number of non-zero components.

4 Kernel Methods and Kernel Computation

Kernel methods supporting kernel functions are often slower than linear ones based on explicit feature maps assuming feature vectors are of a manageable size. This is for example the case for support vector machines, which classify objects according to their location w.r.t. a hyperplane. When computing feature maps explicitly, the normal vector of the hyperplane can be constructed explicitly as well and classification requires computing a single dot product only. The running time for this, essentially depends on number of (non-zero) components of the feature vectors. Using implicit computation the number of kernel computations required depends on the number of support vectors defining the hyperplane. Moreover, the running time for training a support vector machine is linear assuming a constant number of non-zero components 

(Joachims, 2006). The example illustrates that implicit and explicit kernel computation have a significant effect on the running time of the kernel method at the higher level. In order to compare the running time of both strategies systematically without dependence on one specific kernel method, we study the running time to compute a kernel matrix, which stores the kernel values for all pairs of data objects.

4.1 Computing Kernel Matrices

Algorithm 1 generates the kernel matrix in a straightforward manner by directly computing the kernel functions, thus applying a mapping into feature space implicitly.

Input : A set of graphs .
Output : Symmetric kernel matrix with entries .
1 for  to  do
2       for  to  do
              
3            
4      
return
Algorithm 1 Computation by implicit mapping into feature space.

Here, we assume that the procedure ComputeKernel does not internally generate the feature vectors of the two graphs passed as parameters to compute the kernel function, of course. While this would in principle be possible, it would involve computing the feature vector of every graph times. When explicit mapping is applied, the feature vectors can be generated initially once for each graph of the data set. Then the matrix is computed by taking the dot product between these feature vectors, cf. Algorithm 2. This approach is equivalent to computing the matrix product , where is the matrix obtained by row-wise concatenation of the feature vectors.

Input : A set of graphs .
Data : Feature vectors for .
Output : Symmetric kernel matrix with entries .
1 for  to  do
         Compute
2      
3for  to  do Compute
4       for  to  do
              
5            
6      
return
Algorithm 2 Computation by explicit mapping into feature space.

Both approaches differ in terms of running time, which depends on the complexity of the individual procedures that must be computed in the course of the algorithms.

Algorithm 1 computes an kernel matrix in time , where is the running time of ComputeKernel to compute a single kernel value.

Algorithm 2 computes an kernel matrix in time , where is the running time of FeatureMap to compute the feature vector for a single graph and the running time of DotProduct for computing the dot product between two feature vectors.

Clearly, explicit computation can only be competitive with implicit computation, when the time is smaller than . In this case, however, even a time-consuming feature mapping pays off with increasing data set size. The running time , thus, is crucial for explicit computation and depends on the data structure used to store feature vectors.

4.2 Storing Feature Vectors

A common approach to define a feature map is to assign each possible feature to one dimension of the feature space. Then the feature vector of an object is obtained by counting the occurrences of all features in the object. Such feature vectors are typically sparse and many of the theoretically possible features do not occur at all in a specific data set. When considering the label sequences of walks in a molecular graph, for example, a label sequence H=H would correspond to a hydrogen atom with a double bond to another hydrogen atom. This does not occur in valid chemical structures, but cannot be excluded in advance without domain knowledge. This is exploited by sparse data structures for vectors and matrices, which expose running times depending on the number of non-zero components instead of the actual number of all components. One approach to realize a sparse data structure for feature vectors is to employ a hash table. In this case the function can be computed in time in the average case.

5 Basic Kernels, Composed Kernels and Their Feature Maps

Graph kernels, in particular those supporting user-specified kernels for annotations, typically employ closure properties. This allows to decompose graphs into parts that are eventually the annotated vertices and edges. The graph kernel then is composed of base kernels applied to the annotations and annotated substructures, respectively.

We first consider the feature maps of basic kernels and then review closure properties of kernels and discuss how to obtain their feature maps. Some of the basic results on the construction of feature maps and their detailed proof can be found in the text book by Shawe-Taylor and Cristianini (2004). Going beyond that, we discuss the sparsity of the obtained feature vectors in detail, which has an essential impact on efficiency in practice. The results are summarized in Table 1. As we will see in Section 5.1 a large number of components of feature vectors may be—and in practice often is—zero. This is exploited by sparse data structures for vectors and matrices, cf. Section 4.2. Indeed the running times we observed experimentally in Section 7 can only be explained taking the sparsity into account.

5.1 Dirac and Binary Kernels

We discuss feature maps for basic kernels often used for the construction of kernels on structured objects. The Dirac kernel on is defined by , if and otherwise. It is well-known that with components indexed by and defined as if , and otherwise, is a feature map of the Dirac kernel.

More generally, we say a kernel on is binary if is either or for all . Given a binary kernel, we refer to

as the relation on induced by . Next we will establish several properties of this relation, which will turn out to be useful for the construction of a feature map.

Let be a binary kernel on , then holds for all . Assume there are such that and . By the definition of we obtain and . The symmetric kernel matrix obtained by for thus is either or , where we assume that the first row and column is associated with . Both matrices are not p.s.d. and, thus, is not a kernel contradicting the assumption.

Let be a binary kernel on , then is a partial equivalence relation meaning that the relation is (i) symmetric, and (ii) transitive. Property (i) follows from the fact that must be symmetric according to definition. Assume property (ii) does not hold. Then there are with and . Since must hold according to Lemma 5.1 we can conclude that are pairwise distinct. We consider a kernel matrix obtained by for and assume that the first, second and third row as well as column is associated with , and , respectively. There must be entries and . According to Lemma 5.1 the entries of the main diagonal follow. Consider the coefficient vector with and , we obtain Hence, is not p.s.d. and is not a kernel contradicting the assumption.

We use these results to construct a feature map for a binary kernel. We restrict our consideration to the set , on which is an equivalence relation. The quotient set is the set of equivalence classes induced by . Let denote the equivalence class of under the relation . Let be the Dirac kernel on the equivalence classes , then and we obtain the following result. Let be a binary kernel with , then with if , and otherwise, is a feature map of .

Kernel Feature Map Dimension Sparsity
Table 1: Composed kernels, their feature map, dimension and sparsity. We assume to be kernels with feature maps of dimension .

5.2 Closure Properties

For a kernel on a non-empty set the function with in is again a kernel on . Let be a feature map of , then is a feature map of . For addition and multiplication, we get the following result.

[Shawe-Taylor and Cristianini 2004, pp. 75 sqq.] Let for be kernels on with feature maps of dimension , respectively. Then

are again kernels on . Moreover,

are feature maps for and of dimension and , respectively. Here denotes the concatenation of vectors and the Kronecker product.

In case of , we have and a -dimensional feature map can be obtained. For we have , which yet does not allow for a feature space of dimension smaller than in general.

We state an immediate consequence of Proposition 5.2 regarding the sparsity of the obtained feature vectors explicitly.

Let and be defined as above, then

5.3 Kernels on Sets

In the following we derive an explicit mapping for kernels on finite sets. This result will be needed in the succeeding section for constructing an explicit feature map for the -convolution kernel. Let be a base kernel on a set , and let and be finite subsets of . Then the cross product kernel or derived subset kernel on is defined as

(1)

Let be a feature map of , then the function

(2)

is a feature map of the cross product kernel (Shawe-Taylor and Cristianini, 2004, Proposition 9.42). In particular, the feature space of the cross product kernel corresponds to the feature space of the base kernel; both have the same dimension. For the Dirac kernel maps the set to its characteristic vector, which has components and non-zero elements. When is a binary kernel as discussed in Section 5 the number of components reduces to the number of equivalence classes of and the number of non-zero elements becomes the number of cells in the quotient set . In general, we obtain the following result as an immediate consequence of Equation (2). Let be the feature map of the crossproduct kernel and the feature map of its base kernel, then

A crucial observation is that the number of non-zero components of a feature vector depends on both, the cardinality and structure of the set and the feature map acting on the elements of . In the worst-case each element of is mapped by to a feature vector with distinct non-zero components.

5.4 Convolution Kernels

Haussler (1999) proposed -convolution kernels as a generic framework to define kernels between composite objects. In the following we derive feature maps for such kernels by using the results for basic operations introduced in the previous sections. Thereby, we generalize the result presented in (Kriege et al., 2014).

Suppose are the parts of according to some decomposition. Let be a relation such that if and only if can be decomposed into the parts . Let and assume is finite for all . The -convolution kernel is

(3)

where is a kernel on for all .

Assume that we have explicit feature maps for the kernels . We first note that a feature map for can be obtained from the feature maps for by Proposition 5.2.111Note that we may consider every kernel on as kernel on by defining . In fact, Equation (3) for arbitrary can be obtained from the case for an appropriate choice of and as noted by Shin and Kuboyama (2010). If we assume , the -convolution kernel boils down to the crossproduct kernel and we have , where both employ the same base kernel . We use this approach to develop explicit mapping schemes for graph kernels in the following. Let be a feature map for of dimension , then from Equation (2), we obtain an explicit mapping of dimension for the -convolution kernel according to

(4)

As discussed in Section 5.3 the sparsity of simultaneous depends on the number of parts and their relation in the feature space of .

Kriege et al. (2014) considered the special case that is a binary kernel, cf. Section 5.1. From Proposition 5.1 and Equation (4) we directly obtain their result as special case.

[Kriege et al. (2014), Theorem 3] Let be an -convolution kernel with binary kernel and , then with is a feature map of .

6 Application to Graph Kernels

We apply the results obtained in the previous section to graph kernels, which are a prominent example of kernels for structured data. A crucial observation of our study of feature maps for composed kernels is the following. The number of components of the feature vectors increases multiplicative under taking products of kernels; this also holds in terms of non-zero components. Unless feature vectors have few non-zero components, this operation is likely to be prohibitive in practice. However, if feature vectors have exactly one non-zero component like those associated with binary kernels, taking products of kernels is manageable by sparse data structures.

Indeed, this fact explains a recent observation in the development of graph kernels (Morris et al., 2016): Graphs with discrete labels, which can be adequately compared by the Dirac kernel, can be compared efficiently by graph kernels based on explicit feature maps. Whereas graph kernels supporting complex annotations use implicit computation schemes and do not scale well. Typically graph kernels are proposed with one method of computation, either implicit or explicit.

We first discuss two families of kernels for which both computation schemes have been considered previously and put them in the context of our systematic study. We then derive explicit computation schemes of three kernels, for which methods of implicit computation have been proposed. We empirically study both computation schemes for graph kernels confirming our theoretical results experimentally in fine detail in Section 7.

6.1 Explicit Computation for Graphs with Discrete Labels

We review two kernels for which both methods of computation have been used previously. The shortest-path kernel was proposed with an implicit computation scheme, but explicit methods of computation have been reported to be used for graphs with discrete labels. Subgraph or graphlet kernels have been proposed for unlabeled graphs or graphs with discrete labels. The subgraph matching kernel has been developed as an extension for attributed graphs.

6.1.1 Shortest-path Kernel

A classical kernel applicable to attributed graphs is the shortest-path kernel (Borgwardt and Kriegel, 2005). This kernel compares all shortest paths in two graphs according to their lengths and the vertex annotation of their endpoints. The shortest-path kernel is defined as

(5)

where is a kernel comparing vertex labels of the respective starting and end vertices of the paths. Here, denotes the length of a shortest path from to and is a kernel comparing path lengths with if or .

Its computation is performed in two steps (Borgwardt and Kriegel, 2005): for each graph of the data set the complete graph on the vertex set is generated, where an edge is annotated with the length of a shortest path from to . The shortest-path kernel then is equivalent to the walk kernel with fixed length between these transformed graphs, where the kernel essentially compares all pairs of edges. The kernel used to compare path lengths may, for example, be realized by the Brownian Bridge kernel (Borgwardt and Kriegel, 2005).

For the application to graphs with discrete labels a more efficient method of computation by explicit mapping has been reported by Shervashidze et al. (2011, Section 3.4.1). When and both are Dirac kernels, each component of the feature vector corresponds to a triple consisting of two vertex labels and a path length. This method of computation has been applied in several experimental comparisons, e.g., (Kriege and Mutzel, 2012; Morris et al., 2016). This feature map is directly obtained from our results in Section 5. It is as well rediscovered from our explicit computation schemes for fixed length walk kernels reported in Section 6.3. However, we can also derive explicit feature maps for non-trivial kernels and . Then the dimension of the feature map increases due to the product of kernels, cf. Equation 5. We will study this and the effect on running time experimentally in Section 7.

6.1.2 Graphlet, Subgraph and Subgraph Matching Kernels

Given two graphs and in , the subgraph kernel is defined as

(6)

where is the isomorphism kernel, i.e., if and only if and are isomorphic. A similar kernel was defined by Gärtner et al. (2003) and its computation was shown to be -hard. However, it is polynomial time computable when considering only subgraphs up to a fixed size. The subgraph kernel, cf. Equation 6, is easily identified as an instance of the crossproduct kernel, cf. Equation 1. The base kernel is not the trivial Dirac kernel, but binary, cf. Section 5.1. The equivalence classes induced by are referred to as isomorphism classes and distinguish subgraphs up to isomorphism. The feature map of maps a graph to a vector, where each component counts the number of occurrences of a specific graph as subgraph in . Determining the isomorphism class of a graph is known as graph canonization problem and well-studied. By solving the graph canonization problem instead of the graph isomorphism problem we obtain an explicit feature map for the subgraph kernel. Although graph canonization clearly is at least as hard as graph isomorphism, the number of canonizations required is linear in the number of subgraphs, while a quadratic number of isomorphism tests would be required for a single naïve computation of the kernel. The gap in terms of runtime even increases when computing a whole kernel matrix, cf. Section 4.1.

Indeed, the observations above are a key to several graph kernels recently proposed. The graphlet kernel (Shervashidze et al., 2009), also see Section 2, is an instance of the subgraph kernel and computed by explicit feature maps. However, only unlabeled graphs of small size are considered by the graphlet kernel, such that the canonizing function can be computed easily. The same approach was taken by Wale et al. (2008) considering larger connected subgraphs of labeled graphs derived from chemical compounds. On the contrary, for attributed graphs with continuous vertex labels, the function is not sufficient to compare subgraphs adequately. Therefore, subgraph matching kernels were proposed by Kriege and Mutzel (2012), which allow to specify arbitrary kernel functions to compare vertex and edge attributes. Essentially, this kernel considers all mappings between subgraphs and scores each mapping by the product of vertex and edge kernel values of the vertex and edge pairs involved in the mapping. When the specified vertex and edge kernels are Dirac kernels, the subgraph matching kernel is equal to the subgraph kernel up to a factor taking the number of automorphisms between subgraphs into account (Kriege and Mutzel, 2012). Based on the above observations explicit mapping of subgraph matching kernels is likely to be more efficient when subgraphs can be adequately compared by a binary kernel.

6.2 Weighted Vertex Kernels for Attributed Graphs

Kernels suitable for attributed graphs typically use user-defined kernels for the comparison of vertex and edge annotations like real-valued vectors. The graph kernel is then obtained by combining these kernels according to closure properties. Recently proposed kernels for attributed graphs like GraphHopper (Feragen et al., 2013) and GraphInvariant (Orsini et al., 2015) use separate kernels for the graph structure and annotations. They can be expressed as

(7)

where is a user-specified kernel comparing vertex attributes and is a kernel that determines a weight for a vertex pair based on the individual graph structures. Hence, in the following we refer to Equation 7 as weighted vertex kernel. Kernels belonging to this family are easily identifiable as instances of -convolution kernels, cf. Definition 5.4.

For graphs with multi-dimensional real-valued vertex annotations in one could set to the Gaussian RBF kernel or the dimension-wise product of the hat kernel , respectively, i.e.,

(8)

Here, and are parameters controlling the decrease of the kernel value with increasing discrepancy between the two input data points. The selection of the kernel is essential to take the graph structure into account and allows to obtain different instances of weighted vertex kernels.

6.2.1 Weighted Vertex Kernel Instances

One approach to obtain weights for pairs of vertices is to compare their neighborhood by the classical Weisfeiler-Lehman label refinement (Shervashidze et al., 2011; Orsini et al., 2015). For a parameter and a graph with uniform initial labels , a sequence of refined labels referred to as colors is computed, where is obtained from by the following procedure. Sort the multiset of colors for every vertex to obtain a unique sequence of colors and add as first element. Assign a new color to every vertex by employing an injective mapping from sequences to new colors. A reasonable implementation of motivated along the lines of GraphInvariant (Orsini et al., 2015) is

(9)

where denotes the discrete label of the vertex after the -th iteration of Weisfeiler-Lehman label refinement of the underlying unlabeled graph. Intuitively, this kernel reflects to what extent the two vertices have a structurally similar neighborhood.

Another graph kernel, which fits into the framework of weighted vertex kernels, is the GraphHopper kernel (Feragen et al., 2013) with

(10)

Here and are matrices, where the entry for in counts the number of times the vertex appears as the -th vertex on a shortest path of discrete length in , where denotes the maximum diameter over all graphs, and is the Frobenius inner product.

6.2.2 Computing Explicit Feature Maps

In the following we derive an explicit mapping for weighted vertex kernels. Notice that Equation 7 is an instance of Definition 5.4. Hence, by Proposition 5.2 and Equation 4, we obtain an explicit mapping of weighted vertex kernels.

Let be a weighted vertex kernel according to Equation 7 with and feature maps for and , respectively. Then

(11)

is a feature map for .

Widely used kernels for the comparison of attributes, such as the Gaussian RBF kernel, do not have feature maps of finite dimension. However, Rahimi and Recht (2008) obtained finite-dimensional feature maps approximating the kernels and of Equation (8). Similar results are known for other popular kernels for vectorial data like the Jaccard (Vedaldi and Zisserman, 2012) and the Laplacian kernel (Andoni, 2009).

In the following we approximate in Equation 7 by , where is an finite-dimensional, approximative mapping, such that with probability for

(12)

for any , and derive a finite-dimensional, approximative feature map for weighted vertex kernels. We get the following result. Let be a weighted vertex kernel. Let be a feature map for and let be an approximative mapping for according to Equation 12. Then we can compute an approximative feature map for such that with any constant probability

(13)

for any . By setting the failure probability to , and using the union bound we get that for every pair of vertices in the data set with probability

The result then follows from Proposition 6.2.2, and by setting , where is the maximum value attained by the kernel and is the maximum number of vertices over the whole data set.

6.3 Explicit and Implicit Computation of Fixed Length Walk Kernels

The classical walk based graph kernels (Gärtner et al., 2003; Kashima et al., 2003), in theory, take all walks without a limitation of length into account. However, in several applications it has been reported that only walks up to a certain length have been considered, e.g., for the prediction of protein functions (Borgwardt et al., 2005) or image classification (Harchaoui and Bach, 2007). This might suggest that it is not necessary or even not beneficial to consider the infinite number of possible walks to obtain a satisfying prediction accuracy. Recently, the phenomenon of halting in random walk kernels has been studied (Sugiyama and Borgwardt, 2015), which refers to the fact that walk-based graph kernels like the geometric random walk kernel (Gärtner et al., 2003) might down-weight longer walks so much that their value is dominated by walks of length 1. As a consequence, fixed length walk kernels, which consider only walks of (at most) a specified length become promising, in particular for graphs with high degree.

We propose an explicit and implicit computation scheme for fixed length walk kernels. Our product graph based implicit computation scheme fully supports arbitrary vertex and edge kernels and exploits their sparsity. Previously no algorithms based on explicit mapping for computation of walk-based kernels have been proposed. We identify the label diversity and walk lengths as key parameters affecting the running time. This is confirmed experimentally in Section 7.

6.3.1 Basic Definitions

A fixed length walk kernel measures the similarity between graphs based on the similarity between all pairs of walks of length contained in the two graphs. A walk of length in a graph is a sequence of vertices and edges such that for . We denote the set of walks of length in a graph by .

[-walk kernel] The -walk kernel between two attributed graphs and in is defined as

(14)

where is a kernel between walks.

Definition 6.3.1 is very general and does not specify how to compare walks. An obvious choice is to decompose walks and define in terms of vertex and edge kernel functions, denoted by and , respectively. We consider

(15)

where and are two walks.222The same idea to compare walks was proposed by Kashima et al. (2003) as part of the marginalized kernel between labeled graphs. Assume the graphs in a data set have simple vertex and edge labels . An appropriate choice then is to use the Dirac kernel for both, vertex and edge kernels, between the associated labels. In this case two walks are considered equal if and only if the labels of all corresponding vertices and edges are equal. We refer to this kernel by

(16)

where is the Dirac kernel. For graphs with continuous or multi-dimensional annotations this choice is not appropriate and and should be selected depending on the application-specific vertex and edge attributes.

A variant of the -walk kernel can be obtained by considering all walks up to length . [Max--walk kernel] The Max--walk kernel between two attributed graphs and in is defined as

(17)

where are weights. In the following we primary focus on the -walk kernel, although our algorithms and results can be easily transferred to the Max--walk kernel.

6.3.2 Walk and -convolution Kernels

We show that the -walk kernel is p.s.d. if is a valid kernel by seeing it as an instance of an -convolution kernel. We use this fact to develop an algorithm for explicit mapping based on the ideas presented in Section 5.4.

The -walk kernel is positive semidefinite if is defined according to Equation (15) and and are valid kernels. Equation (14) with defined according to Equation (15) is the -convolution kernel, cf. Definition 5.4, directly obtained when graphs are decomposed into walks and

for . Then equals , implying that the -walk kernel is a valid kernel if and are valid kernels.

Since kernels are closed under taking linear combinations with non-negative coefficients, see Proposition 5.2, we obtain the following corollary. The Max--walk kernel is positive semidefinite.

Since -walk kernels are -convolution kernels, we can derive a feature map. Our theoretical results show that the dimension of the feature space and the density of feature vectors for depend multiplicative on the same properties of the feature vectors for and . Hence we consider a special case of high practical relevance: We assume graphs to have simple labels from the alphabet and consider the kernel given by Equation (16). A walk of length is then associated with a label sequence . In this case graphs are decomposed into walks and two walks and are considered equivalent if and only if ; each label sequence can be considered an identifier of an equivalence class of . This gives rise to the feature map , where each component is associated with a label sequence and counts the number of walks with . Note that the obtained feature vectors have components, but are typically sparse.

We can easily derive a feature map of the Max--walk kernel from the feature maps of all -walk kernels with , cf. Proposition 5.2.

6.3.3 Implicit Kernel Computation

An essential part of the implicit computation scheme is the generation of the product graph that is then used to compute the -walk kernel.

Computing Direct Product Graphs.

In order to support graphs with arbitrary attributes, vertex and edge kernels and are considered as part of the input. Product graphs can be used to represent these kernel values between pairs of vertices and edges of the input graphs in a compact manner. We avoid to create vertices and edges that would represent incompatible pairs with kernel value zero. The following definition can be considered a weighted version of the direct product graph introduced by Gärtner et al. (2003) for kernel computation.333Note that we consider undirected graphs while Gärtner et al. (2003) refers to directed graphs.

[Weighted Direct Product Graph] For two attributed graphs , and given vertex and edge kernels and , the weighted direct product graph (WDPG) is denoted by and defined as

(a)
(b)
(c)
Figure 1: Two attributed graphs  LABEL: and  LABEL: and their weighted direct product graph  LABEL:. We assume the vertex kernel to be the Dirac kernel and to be if edge labels are equal and if one edge label is “=” and the other is “-”. Thin edges in represent edges with weight , while all other edges and vertices have weight .

An example with two graphs and their weighted direct product graph obtained for specific vertex and edge kernels is shown in Figure 1. Algorithm 3 computes a weighted direct product graph and does not consider edges between pairs of vertices that have been identified as incompatible, i.e., .

Input : Graphs and , vertex and edge kernels and .
Output : Graph .
Procedure Wdpg()
1       forall ,  do
2             if  then
3                   create vertex
4            
5      forall  do
6             forall  with ,  do
7                   if  then
8                         create edge
9                  
10            
11      
Algorithm 3 Weighted Direct Product Graph

Since the weighted direct product graph is undirected, we must avoid that the same pair of edges is processed twice. Therefore, we suppose that there is an arbitrary total order on the vertices , such that for every pair either or holds. In line 3 we restrict the edge pairs that are compared to one of these cases.

Let , and , . Algorithm 3 computes the weighted direct product graph in time , where and is the running time to compute vertex and edge kernels, respectively.

Note that in case of a sparse vertex kernel, which yields zero for most of the vertex pairs of the input graph, holds. Algorithm 3 compares two edges by only in case of matching endpoints (cf. lines 3, 3), therefore in practice the running time to compare edges (line 33) might be considerably less than suggested by Proposition 6.3.3. We show this empirically in Section 7.3. In case of sparse graphs, i.e., , and vertex and edge kernels which can be computed in time the running time of Algorithm 3 is , where .

Counting Weighted Walks.

Given an undirected graph with adjacency matrix , let denote the element at of the matrix . It is well-known that is the number of walks from vertex to of length . The number of -walks of consequently is , where with . The -th element of the recursively defined vector is the number of walks of length starting at vertex . Hence, we can compute the number of -walks by computing either matrix powers or matrix-vector products. Note that even for sparse (connected) graphs quickly becomes dense with increasing walk length . The -th power of an matrix can be computed naïvely in time and using exponentiation by squaring, where is the exponent of matrix multiplication. The vector can be computed by means of matrix-vector multiplications, where the matrix remains unchanged over all iterations. Since direct product graphs tend to be sparse in practice, we propose a method to compute the -walk kernel that is inspired by matrix-vector multiplication.

In order to compute the -walk kernel we do not want to count the walks, but sum up the weights of each walk, which in turn are the product of vertex and edge weights. Let be defined according to Equation (15), then we can formulate the -walk kernel as