KernelGCN
Codes for NIPS 2019 Paper: Rethinking Kernel Methods for Node Representation Learning on Graphs
view repo
Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification. However, the use of kernel methods for node classification, which is a related problem to graph representation learning, is still ill-posed and the state-of-the-art methods are heavily based on heuristics. Here, we present a novel theoretical kernel-based framework for node classification that can bridge the gap between these two representation learning problems on graphs. Our approach is motivated by graph kernel methodology but extended to learn the node representations capturing the structural information in a graph. We theoretically show that our formulation is as powerful as any positive semidefinite kernels. To efficiently learn the kernel, we propose a novel mechanism for node feature aggregation and a data-driven similarity metric employed during the training phase. More importantly, our framework is flexible and complementary to other graph-based deep learning models, e.g., Graph Convolutional Networks (GCNs). We empirically evaluate our approach on a number of standard node classification benchmarks, and demonstrate that our model sets the new state of the art.
READ FULL TEXT VIEW PDFCodes for NIPS 2019 Paper: Rethinking Kernel Methods for Node Representation Learning on Graphs
Graph structured data, such as citation networks Giles et al. (1998); McCallum et al. (2000); Sen et al. (2008), biological models Gilmer et al. (2017); You et al. (2018), grid-like data Tang et al. (2018); Tian et al. (2018); Zhu et al. (2018) and skeleton-based motion systems Chen et al. (2019); Yan et al. (2018); Zhao et al. (2018, 2019)
, are abundant in the real world. Therefore, learning to understand graphs is a crucial problem in machine learning. Previous studies in the literature generally fall into two main categories: (1)
graph classification Draief et al. (2018); Kipf and Welling (2017); Xu et al. (2019); Zhang et al. (2018b, c), where the whole structure of graphs is captured for similarity comparison; (2) node classification Abu-El-Haija et al. (2018); Kipf and Welling (2017); Veličković et al. (2018); Xu et al. (2018); Zhang et al. (2018a), where the structural identity of nodes is determined for representation learning.For graph classification, kernel methods, i.e., graph kernels, have become a standard tool Kriege et al. (2017)
. Given a large collection of graphs, possibly with node and edge attributes, such algorithms aim to learn a kernel function that best captures the similarity between any two graphs. The graph kernel function can be utilized to classify graphs via standard kernel methods such as support vector machines or
-nearest neighbors. Moreover, recent studies Xu et al. (2019); Zhang et al. (2018b)also demonstrate that there has been a close connection between Graph Neural Networks (GNNs) and the Weisfeiler-Lehman graph kernel
Shervashidze et al. (2011), and relate GNNs to the classic graph kernel methods for graph classification.Node classification, on the other hand, is still an ill-posed problem in representation learning on graphs. Although identification of node classes often leverages their features, a more challenging and important scenario is to incorporate the graph structure for classification. Recent efforts in Graph Convolutional Networks (GCNs) Kipf and Welling (2017) have made great progress on node classification. In particular, these efforts broadly follow a recursive neighborhood aggregation scheme to capture structural information, where each node aggregates feature vectors of its neighbors to compute its new features Abu-El-Haija et al. (2018); Xu et al. (2018); Zhang et al. (2018a). Empirically, these GCNs have achieved the state-of-the-art performance on node classification. However, the design of new GCNs is mostly based on empirical intuition, heuristics, and experimental trial-and-error.
In this paper, we propose a novel theoretical framework leveraging kernel methods for node classification. Motivated by graph kernels, our key idea is to decouple the kernel function so that it can be learned driven by the node class labels on the graph. Meanwhile, its validity and expressive power are guaranteed. To be specific, this paper makes the following contributions:
[leftmargin=*]
We propose a learnable kernel-based framework for node classification. The kernel function is decoupled into a feature mapping function and a base kernel to ensure that it is valid as well as learnable. Then we present a data-driven similarity metric and its corresponding learning criteria for efficient kernel training. The implementation of each component is extensively discussed. An overview of our framework is shown in Fig. 1.
We demonstrate the validity of our learnable kernel function. More importantly, we theoretically show that our formulation is powerful enough to express any valid positive semidefinite kernels.
A novel feature aggregation mechanism for learning node representations is derived from the perspective of kernel smoothing. Compared with GCNs, our model captures the structural information of a node by aggregation in a single step, other than a recursive manner, thus is more efficient.
We discuss the close connection between the proposed approach and GCNs. We also show that our method is flexible and complementary to GCNs and their variants but more powerful, and can be leveraged as a general framework for future work.
Graph Kernels. Graph kernels are kernels defined on graphs to capture the graph similarity, which can be used in kernel methods for graph classification. Many graph kernels are instances of the family of convolutional kernels Haussler (1999). Some of them measure the similarity between walks or paths on graphs Borgwardt and Kriegel (2005); Vishwanathan et al. (2010). Other popular kernels are designed based on limited-sized substructures Horváth et al. (2004); Shervashidze et al. (2009); Shervashidze and Borgwardt (2009); Shervashidze et al. (2011). Most graph kernels are employed in models which have learnable components, but the kernels themselves are hand-crafted and motivated by graph theory. Some learnable graph kernels have been proposed recently, such as Deep Graph Kernels Yanardag and Vishwanathan (2015) and Graph Matching Networks Li et al. (2019). Compared to these approaches, our method targets at learning kernels for node representation learning.
Node Representation Learning. Conventional methods for learning node representations largely focus on matrix factorization. They directly adopt classic techniques for dimension reduction Ahmed et al. (2013); Belkin and Niyogi (2002). Other methods are derived from the random walk algorithm Mikolov et al. (2013); Perozzi et al. (2014) or sub-graph structures Grover and Leskovec (2016); Tang et al. (2015); Yang et al. (2016); Ribeiro et al. (2017). Recently, Graph Convolutional Networks (GCNs) have emerged as an effective class of models for learning representations of graph structured data. They were introduced in Kipf and Welling (2017), which consist of an iterative process aggregating and transforming representation vectors of its neighboring nodes to capture structural information. Recently, several variants have been proposed, which employ self-attention mechanism Veličković et al. (2018) or improve network architectures Xu et al. (2018); Zhang et al. (2018a) to boost the performance. However, most of them are based on empirical intuition and heuristics.
We begin by summarizing some of the most important concepts about kernel methods as well as representation learning on graphs and, along the way, introduce our notations.
Kernel Concepts. A kernel is a function of two arguments: for . The kernel function is symmetric, i.e., , which means it can be interpreted as a measure of similarity. If the Gram matrix defined by for any is positive semidefinite (p.s.d.), then is a p.s.d. kernel Murphy (2012). If can be represented as , where is a feature mapping function, then is a valid kernel.
Graph Kernels. In the graph space , we denote a graph as , where is the set of nodes and is the edge set of . Given two graphs and in , the graph kernel measures the similarity between them. According to the definition in Scholkopf and Smola (2001), the kernel must be p.s.d. and symmetric. The graph kernel between and is defined as:
(1) |
where is the base kernel for any pair of nodes in and , and is a function to compute the feature vector associated with each node. However, deriving a new p.s.d. graph kernel is a non-trivial task. Previous methods often implement and as the dot product between hand-crafted graph heuristics Neuhaus and Bunke (2005); Shervashidze and Borgwardt (2009); Borgwardt and Kriegel (2005). There are little learnable parameters in these approaches.
Representation Learning on Graphs. Although graph kernels have been applied to a wide range of applications, most of them depend on hand-crafted heuristics. In contrast, representation learning aims to automatically learn to encode graph structures into low-dimensional embeddings. Formally, given a graph , we follow Hamilton et al. (2017) to define representation learning as an encoder-decoder framework, where we minimize the empirical loss over a set of training node pairs :
(2) |
Equation (2) has three methodological components: ENC-DEC, and . Most of the previous methods on representation learning can be distinguished by how these components are defined. The detailed meaning of each component is explained as follows.
[leftmargin=*]
is an encoder-decoder function. It contains an encoder which projects each node into a -dimensional vector to generate the node embedding. This function contains a number of trainable parameters to be optimized during the training phase. It also includes a decoder function, which reconstructs pairwise similarity measurements from the node embeddings generated by the encoder.
is a pairwise similarity function defined over the graph . This function is user-specified, and it is used for measuring the similarity between nodes in .
is a loss function
, which is leveraged to train the model. This function evaluates the quality of the pairwise reconstruction between the estimated value
and the true value .Given a graph , as we can see from Eq. (2), the encoder-decoder ENC-DEC aims to approximate the pairwise similarity function , which leads to a natural intuition: we can replace ENC-DEC with a kernel function parameterized by to measure the similarity between nodes in , i.e.,
(3) |
However, there exist two technical challenges: (1) designing a valid p.s.d. kernel which captures the node feature is non-trivial; (2) it is impossible to handcraft a unified kernel to handle all possible graphs with different characteristics Ramon and Gärtner (2003). To tackle these issues, we introduce a novel formulation to replace . Inspired by the graph kernel as defined in Eq. (1) and the mapping kernel framework Shin and Kuboyama (2008), our key idea is to decouple into two components: a base kernel which is p.s.d. to maintain the validity, and a learnable feature mapping function to ensure the flexibility of the resulting kernel. Therefore, we rewrite Eq. (3) by for of the graph to optimize the following objective:
(4) |
Theorem 1 demonstrates that the proposed formulation, i.e., , is still a valid p.s.d. kernel for any feature mapping function parameterized by .
Let be a function which maps nodes (or their corresponding features) to a M-dimensional Euclidean space. Let be any valid p.s.d. kernel. Then, is a valid p.s.d. kernel.
Let be the corresponding feature mapping function of the p.s.d. kernel . Then, we have , where . Substitute for , and we have . Write the new feature mapping as , and we immediately have that . Hence, is a valid p.s.d. kernel. ∎
A natural follow-up question is whether our proposed formulation, in principle, is powerful enough to express any valid p.s.d. kernels? Our answer, in Theorem 2, is yes: if the base kernel has an invertible feature mapping function, then the resulting kernel is able to model any valid p.s.d. kernels.
Let be any valid p.s.d. kernel for node pairs . Let be a p.s.d. kernel which has an invertible feature mapping function . Then there exists a feature mapping function , such that .
Let be the corresponding feature mapping function of the p.s.d. kernel , and then we have . Similarly, for , we have . Substitute for , and then it is easy to see that is the desired feature mapping function when exists. ∎
Theorems 1 and 2 have demonstrated the validity and power of the proposed formulation in Eq. (4). In this section, we discuss how to implement and learn , , and , respectively.
Implementation of the Feature Mapping Function . The function aims to project the feature vector of each node into a better space for similarity measurement. Our key idea is that in a graph, connected nodes usually share some similar characteristics, and thus changes between nearby nodes in the latent space of nodes should be smooth. Inspired by the concept of kernel smoothing, we consider as a feature smoother which maps into a smoothed latent space according to the graph structure. The kernel smoother estimates a function as the weighted average of neighboring observed data. To be specific, given a node , according to Nadaraya-Watson kernel-weighted average Friedman et al. (2001), a feature smoothing function is defined as:
(5) |
where is a mapping function to compute the feature vector of each node, and here we let ; is a pre-defined kernel function to capture pairwise relations between nodes. Note that we omit for here since there are no learnable parameters in Eq. (5). In the context of graphs, the natural choice of computing is to follow the graph structure, i.e., the structural information within the node’s -hop neighborhood.
To compute , we let be the adjacent matrix of the given graph and
be the identity matrix with the same size. We notice that
is a valid p.s.d. matrix, where . Thus we can employ this matrix to define the kernel function . However, in practice, this matrix would lead to numerical instabilities and exploding or vanishing gradients when used for training deep neural networks. To alleviate this problem, we adopt the renormalization trick Kipf and Welling (2017): , where and . Then the -hop neighborhood can be computed directly from the power of , i.e., . And the kernel for node pairs is computed as . After collecting the feature vector of each node into a matrix , we rewrite Eq. (5) approximately into its matrix form:(6) |
Next, we enhance the expressive power of Eq. (6
) to model any valid p.s.d. kernels by implementing it with deep neural networks based on the following two aspects. First, we make use of multi-layer perceptrons (MLPs) to model and learn the composite function
in Theorem 2, thanks to the universal approximation theorem Hornik (1991); Hornik et al. (1989). Second, we add learnable weights to different hops of node neighbors. As a result, our final feature mapping function is defined as:(7) |
where means the set of parameters in ; is a learnable parameter for the -hop neighborhood of each node ; is the Hadamard (element-wise) product; is an indicator matrix where equals to 1 if is a -th hop neighbor of
and 0 otherwise. The hyperparameter
controls the number of layers in the MLP.Equation (7) can be interpreted as a weighted feature aggregation schema around the given node and its neighbors, which is employed to compute the node representation. It has a close connection with Graph Neural Networks. We leave it in Section 5 for a more detailed discussion.
Implementation of the Base Kernel . As we have shown in Theorem 2, in order to model an arbitrary p.s.d. kernel, we require that the corresponding feature mapping function of the base kernel must be invertible, i.e., exists. An obvious choice would let be an identity function, then will reduce to the dot product between nodes in the latent space. Since maps node representations to a finite dimensional space, the identity function makes our model directly measure the node similarity in this space. On the other hand, an alternative choice of is the RBF kernel which additionally projects node representations to an infinite dimensional latent space before comparison. We compare both implementations in the experiments for further evaluation.
Data-Driven Similarity Metric and Criteria . In node classification, each node is associated with a class label . We aim to measure node similarity with respect to their class labels other than hand-designed metrics. Naturally, we define the pairwise similarity as:
(8) |
However, in practice, it is hard to directly minimize the loss between and in Eq. (8). Instead, we consider a “soft” version of , where we require that the similarity of node pairs with the same label is greater than those with distinct labels by a margin. Therefore, we train the kernel to minimize the following objective function on triplets:
(9) |
where is a set of node triplets: is an anchor, and is a positive of the same class as the anchor while is a negative of a different class. The loss function is defined as:
(10) |
It ensures that given two positive nodes of the same class and one negative node, the kernel value of the negative should be farther away than the one of the positive by the margin . Here, we present Theorem 3 and its proof to show that minimizing Eq. (9) leads to .
If for any , minimizing Eq. (9) with yields .
Let be all triplets satisfying , . Suppose that for , Eq. (10) holds for all . It means for all . As , we have for all and for all . Hence, . ∎
We note that can be simply achieved by letting be the dot product and normalizing all to the norm ball. In the following sections, the normalized is denoted by .
Once the kernel function has learned how to measure the similarity between nodes, we can leverage the output of the feature mapping function as the node representation for node classification. In this paper, we introduce the following two classifiers.
Nearest Centroid Classifier. The nearest centroid classifier extends the -nearest neighbors algorithm by assigning to observations the label of the class of training samples whose centroid is closest to the observation. It does not require additional parameters. To be specific, given a testing node , for all nodes with class label in the training set, we compute the per-class average similarity between and : , where is the set of nodes belonging to class . Then the class assigned to the testing node :
(11) |
Softmax Classifier. The idea of the softmax classifier is to reuse the ground truth labels of nodes for training the classifier, so that it can be directly employed for inference. To do this, we add the softmax activation after to minimize the following objective:
(12) |
where is the one-hot ground truth vector. Note that Eq. (12) is optimized together with Eq. (9) in an end-to-end manner. Let denote the corresponding feature mapping function of , then we have . In this case, we use the node feature produced by for classification since projects node features into the dot-product space which is a natural metric for similarity comparison. To this end, is fixed to be the identity function for the softmax classifier, so that we have and thus .
Our feature mapping function proposed in Eq. (7) has a close connection with Graph Convolutional Networks (GCNs) Kipf and Welling (2017) in the way of capturing node latent representations. In GCNs and most of their variants, each layer leverages the following aggregation rule:
(13) |
where is a layer-specific trainable weighting matrix;
denotes an activation function;
denotes the node features in the -th layer, and . Through stacking multiple layers, GCNs aggregate the features for each node from its -hop neighbors recursively, where is the network depth. Compared with the proposed , GCNs actually interleave two basic operations of : feature transformation and Nadaraya-Watson kernel-weighted average, and repeat them recursively.We contrast our approach with GCNs in terms of the following aspects. First, our aggregation function is derived from the kernel perspective, which is novel. Second, we show that aggregating features in a recursive manner is inessential. Powerful -hop node representations can be obtained by our model where aggregation is performed only once. As a result, our approach is more efficient both in storage and time when handling very large graphs, since no intermediate states of the network have to be kept. Third, our model is flexible and complementary to GCNs: our function can be directly replaced by GCNs and other variants, which can be exploited for future work.
Time and Space Complexity. We assume the number of features is fixed for all layers and both GCNs and our method have layers. We count matrix multiplications as in Chiang et al. (2019). GCN’s time complexity is , where is the number of nonzeros of and is the number of nodes in the graph. While ours is , since we do not aggregate features recursively. Obviously, is constant but is linear to . For space complexity, GCNs have to store all the feature matrices for recursive aggregation which needs space, where is for storing trainable parameters of all layers, and thus the first term is linear to . Instead, ours is where the first term is again constant to . Our experiments indicate that we save 20% (0.3 ms) time and 15% space on Cora dataset McCallum et al. (2000) than GCNs.
We evaluate the proposed kernel-based approach on three benchmark datasets: Cora McCallum et al. (2000), Citeseer Giles et al. (1998) and Pubmed Sen et al. (2008). They are citation networks, where the task of node classification is to classify academic papers of the network (graph) into different subjects. These datasets contain bag-of-words features for each document (node) and citation links between documents.
We compare our approach to five state-of-the-art methods: GCN Kipf and Welling (2017), GAT Veličković et al. (2018), FastGCN Chen et al. (2018), JK Xu et al. (2018) and KLED Fouss et al. (2006)
. KLED is a kernel-based method, while the others are based on deep neural networks. We test all methods in the supervised learning scenario, where all data in the training set are used for training. We evaluate the proposed method in two different experimental settings according to FastGCN
Chen et al. (2018) and JK Xu et al. (2018), respectively. The statistics of the datasets together with their data split settings (i.e., the number of samples contained in the training, validation and testing sets, respectively) are summarized in Table 1. Note that there are more training samples in the data split of JK Xu et al. (2018) than FastGCN Chen et al. (2018). We report the average means and standard deviations of node classification accuracy which are computed from ten runs as the evaluation metrics.
Dataset | Nodes | Edges | Classes | Features | Data split of FastGCN Chen et al. (2018) | Data split of JK Xu et al. (2018) |
---|---|---|---|---|---|---|
Cora McCallum et al. (2000) | 2,708 | 5,429 | 7 | 1,433 | 1,208 / 500 / 1,000 | 1,624 / 542 / 542 |
Citeseer Giles et al. (1998) | 3,327 | 4,732 | 6 | 3,703 | 1,827 / 500 / 1,000 | 1,997 / 665 / 665 |
Pubmed Sen et al. (2008) | 19,717 | 44,338 | 3 | 500 | 18,217 / 500 / 1,000 | - |
As we have shown in Section 4.1, there are alternative choices to implement each component of our framework. In this section, we summarize all the variants of our method employed for evaluation.
Choices of the Feature Mapping Function . We implement the feature mapping function according to Eq. (7). In addition, we also choose GCN and GAT as the alternative implementations of for comparison, and denote them by and , respectively.
Choices of the Base Kernel . The base kernel has two different implementations: the dot product which is denoted by , and the RBF kernel which is denoted by . Note that when the softmax classifier is employed, we set the base kernel to be .
Choices of the Loss and Classifier . We consider the following three combinations of the loss function and classifier. (1) in Eq. (9) is optimized, and the nearest-centroid classifier is employed for classification. This combination aims to evaluate the effectiveness of the learned kernel. (2) in Eq. (12) is optimized, and the softmax classifier is employed for classification. This combination is used in a baseline without kernel methods. (3) Both Eq. (9) and Eq. (12) are optimized, and we denote this loss by . The softmax classifier is employed for classification. This combination aims to evaluate how the learned kernel improves the baseline method.
In the experiments, we use to denote kernel-based variants and to denote ones without the kernel function. All these variants are implemented by MLPs with two layers. Due to the space limitation, we ask the readers to refer to the supplementary material for implementation details.
The means and standard deviations of node classification accuracy (%) following the setting of FastGCN Chen et al. (2018) are organized in Table 2. Our variant of sets the new state of the art on all datasets. And on Pubmed dataset, all our variants improve previous methods by a large margin. It proves the effectiveness of employing kernel methods for node classification, especially on datasets with large graphs. Interestingly, our non-kernel baseline even achieves the state-of-the-art performance, which shows that our feature mapping function can capture more flexible structural information than previous GCN-based approaches. For the choice of the base kernel, we can find that outperforms on two large datasets: Citeseer and Pubmed. We conjecture that when handling complex datasets, the non-linear kernel, e.g., the RBF kernel, is a better choice than the liner kernel.
To evaluate the performance of our feature mapping function, we report the results of two variants and in Table 2. They utilize GCN and GAT as the feature mapping function respectively. As expected, our outperforms and among most datasets. This demonstrates that the recursive aggregation schema of GCNs is inessential, since the proposed aggregates features only in a single step, which is still powerful enough for node classification. On the other hand, it is also observed that both and outperform their original non-kernel based implementations, which shows that learning with kernels yields better node representations.
Table 3 shows the results following the setting of JK Xu et al. (2018). Note that we do not evaluate on Pubmed in this setup since its corresponding data split for training and evaluation is not provided by Xu et al. (2018). As expected, our method achieves the best performance among all datasets, which is consistent with the results in Table 2. For Cora, the improvement of our method is not so significant. We conjecture that the results in Table 3 involve more training data due to different data splits, which narrows the performance gap between different methods on datasets with small graphs, such as Cora.
Method | Cora McCallum et al. (2000) | Citeseer Giles et al. (1998) | Pubmed Sen et al. (2008) |
---|---|---|---|
KLED Fouss et al. (2006) | 82.3 | - | 82.3 |
GCN Kipf and Welling (2017) | 86.0 | 77.2 | 86.5 |
GAT Veličković et al. (2018) | 85.6 | 76.9 | 86.2 |
FastGCN Chen et al. (2018) | 85.0 | 77.6 | 88.0 |
86.68 0.17 | 77.92 0.25 | 89.22 0.17 | |
86.12 0.05 | 78.68 0.38 | 89.36 0.21 | |
88.40 0.24 | 80.28 0.03 | 89.42 0.01 | |
87.56 0.14 | 79.80 0.03 | 89.24 0.14 | |
87.04 0.09 | 77.12 0.23 | 87.84 0.12 | |
86.10 0.33 | 77.92 0.19 | - |
Method | Cora McCallum et al. (2000) | Citeseer Giles et al. (1998) |
---|---|---|
GCN Kipf and Welling (2017) | 88.20 0.70 | 77.30 1.30 |
GAT Veličković et al. (2018) | 87.70 0.30 | 76.20 0.80 |
JK-Concat Xu et al. (2018) | 89.10 1.10 | 78.30 0.80 |
89.24 0.31 | 80.78 0.28 |
In Table 4, we implement three variants of (2-hop and 2-layer with by default) to evaluate the proposed node feature aggregation schema. We answer the following three questions. (1) How does performance change with fewer (or more) hops? We change the number of hops from 1 to 3, and the performance improves if it is larger, which shows capturing long-range structures of nodes is important. (2) How many layers of MLP are needed? We show results with different layers ranging from 1 to 3. The best performance is obtained with two layers, while networks overfit the data when more layers are employed. (3) Is it necessary to have a trainable parameter ? We replace with a fixed constant , where . We can see larger improves the performance. However, all results are worse than learning a weighting parameter , which shows the importance of it.
Variants of | Cora McCallum et al. (2000) | Citeseer Giles et al. (1998) | Pubmed Sen et al. (2008) |
---|---|---|---|
Default | 88.40 0.24 | 80.28 0.03 | 89.42 0.01 |
1-hop | 85.56 0.02 | 77.73 0.02 | 88.98 0.01 |
3-hop | 88.25 0.01 | 80.13 0.01 | 89.53 0.01 |
1-layer | 82.60 0.01 | 77.63 0.01 | 85.80 0.01 |
3-layer | 86.33 0.04 | 78.53 0.20 | 89.46 0.05 |
69.33 0.09 | 74.48 0.03 | 84.68 0.02 | |
76.98 0.10 | 77.47 0.04 | 86.45 0.01 | |
84.25 0.01 | 77.99 0.01 | 87.45 0.01 | |
87.31 0.01 | 78.57 0.01 | 88.68 0.01 |
We visualize the node embeddings of GCN, GAT and our method on Citeseer with t-SNE. For our method, we use the embedding of which obtains the best performance. Figure 2 illustrates the results. Compared with other methods, our method produces a more compact clustering result. Specifically our method clusters the “red” points tightly, while in the results of GCN and GAT, they are loosely scattered into other clusters. This is caused by the fact that both GCN and GAT minimize the classification loss , only targeting at accuracy. They tend to learn node embeddings driven by those classes with the majority of nodes. In contrast, are trained with both and . Our kernel-based similarity loss encourages data within the same class to be close to each other. As a result, the learned feature mapping function encourages geometrically compact clusters.
Due to the space limitation, we ask the readers to refer to the supplementary material for more experiment results, such as the results of link prediction and visualization on other datasets.
In this paper, we introduce a kernel-based framework for node classification. Motivated by the design of graph kernels, we learn the kernel from ground truth labels by decoupling the kernel function into a base kernel and a learnable feature mapping function. More importantly, we show that our formulation is valid as well as powerful enough to express any p.s.d. kernels. Then the implementation of each component in our approach is extensively discussed. From the perspective of kernel smoothing, we also derive a novel feature mapping function to aggregate features from a node’s neighborhood. Furthermore, we show that our formulation is closely connected with GCNs but more powerful. Experiments on standard node classification benchmarks are conducted to evaluated our approach. The results show that our method outperforms the state of the art.
This work is funded by ARO-MURI-68985NSMUR and NSF 1763523, 1747778, 1733843, 1703883.
We use different network settings for the combinations of the loss function and inference method in Section 6.1
of the original paper. For Variant (1), we choose the output dimension of the first and second layers to be 512 and 128, respectively. We train this combination with 10 epochs on Cora and Citeseer and 100 epochs on Pubmed.
For GAT Veličković et al. (2018), due to its large memory cost, its output dimension of the first and second layers is chosen to be 64 and 8, respectively.
For Variants (2) and (3), the output dimension of the first layer is chosen to be 16. The output dimension of the second layer is the same as the number of node classes. We train this combination 100 epochs for GAT and 200 epochs for other setups.
In Eq. (9) of the original paper, we randomly sample 10,000 triplets in each epoch. In Eq. (10) of the original paper, is set to be 0.1 for all datasets. All methods are optimized using Adam Kingma and Ba (2014) with the learning rate of 0.01. We use the best model achieved on the validation set for testing. Each result is reported based on an average over 10 runs.
In addition to node classification, we also conduct experiments for link prediction to demonstrate the generalizability of the proposed framework in different graph-based tasks. We train the models using an incomplete version of the three citation datasets (Cora, Citeseer and Pubmed) according to Kipf and Welling (2016): the node features remain but parts of the citation links (edges) are missing. The validation and test sets are constructed following the setup of Kipf and Welling (2016).
We choose to be the dot product and set to be the feature mapping function. Given graph , for , the similarity measure is defined as:
(14) |
The feature mapping function can be learned by minimizing the following objective function in a data-driven manner:
(15) |
where is the set of training edges, and is the binary cross entropy loss.
Table 5
summarizes the link prediction results of our kernel-based method, the variational graph autoencoder (VGAE)
Kipf and Welling (2016) and its non-probabilistic variant (GAE). Our kernel-based method is highly comparable with these state-of-the-art methods, showing the potential of applying the proposed framework in different applications on graphs.Cora | Citeseer | Pubmed | ||||
---|---|---|---|---|---|---|
Method | AUC | AP | AUC | AP | AUC | AP |
GAE Kipf and Welling (2016) | 91.0 0.02 | 92.0 0.03 | 89.5 0.04 | 89.9 0.05 | 96.4 0.00 | 96.5 0.00 |
VGAE Kipf and Welling (2016) | 91.4 0.01 | 92.6 0.01 | 90.8 0.02 | 92.0 0.02 | 94.4 0.02 | 94.7 0.02 |
Ours | 93.1 0.06 | 93.2 0.07 | 90.9 0.08 | 91.8 0.04 | 94.5 0.03 | 94.2 0.01 |
We visualize the node embeddings of GCN Kipf and Welling (2017), GAT Veličković et al. (2018) and our method on Cora with t-SNE in Fig. 3. Our method produces tight and clear clustering embeddings (especially for the “red” points and “violet” points), which shows that compared with GCN and GAT, our method is able to learn more reasonable feature embeddings for nodes.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)
, pp. 488–495. Cited by: §2.Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 339–354. Cited by: §1.Revisiting semi-supervised learning with graph embeddings
. In Proceedings of the International Conference on Machine Learning (ICML), pp. 40–48. Cited by: §2.Retgk: graph kernels based on return probabilities of random walks
. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3964–3974. Cited by: §1.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 3425–3435. Cited by: §1.