Code and data for the paper `Bayesian Semi-supervised Learning with Graph Gaussian Processes'
We propose a data-efficient Gaussian process-based Bayesian approach to the semi-supervised learning problem on graphs. The proposed model shows extremely competitive performance when compared to the state-of-the-art graph neural networks on semi-supervised learning benchmark experiments, and outperforms the neural networks in active learning experiments where labels are scarce. Furthermore, the model does not require a validation data set for early stopping to control over-fitting. Our model can be viewed as an instance of empirical distribution regression weighted locally by network connectivity. We further motivate the intuitive construction of the model with a Bayesian linear model interpretation where the node features are filtered by an operator related to the graph Laplacian. The method can be easily implemented by adapting off-the-shelf scalable variational inference algorithms for Gaussian processes.READ FULL TEXT VIEW PDF
We develop fast algorithms for solving the variational and game-theoreti...
In this work we discuss the problem of active learning. We present an
Graph convolutional neural networks (GCNs) have recently demonstrated
Link prediction aims to reveal missing edges in a graph. We address this...
Propagating input uncertainty through non-linear Gaussian process (GP)
Hypergraphs are a common model for multiway relationships in data, and
A KL-divergence objective of the joint distribution of data and labels a...
Code and data for the paper `Bayesian Semi-supervised Learning with Graph Gaussian Processes'
Data sets with network and graph structures that describe the relationships between the data points (nodes) are abundant in the real world. Examples of such data sets include friendship graphs on social networks, citation networks of academic papers, web graphs and many others. The relational graphs often provide rich information in addition to the node features that can be exploited to build better predictive models of the node labels, which can be costly to collect. In scenarios where there are not enough resources to collect sufficient labels, it is important to design data-efficient models that can generalize well with few training labels. The class of learning problems where a relational graph of the data points is available is referred to as graph-based semi-supervised learning in the literature (chapelle2009semi, ; zhu2005semi, ).
Many of the successful graph-based semi-supervised learning models are based on graph Laplacian regularization or learning embeddings of the nodes. While these models have been widely adopted, their predictive performance leaves room for improvement. More recently, powerful graph neural networks that surpass Laplacian and embedding based methods in predictive performance have become popular. However, neural network models require relatively larger number of labels to prevent over-fitting and work well. We discuss the existing models for graph-based semi-supervised learning in detail in Section 4.
We propose a new Gaussian process model for graph-based semi-supervised learning problems that can generalize well with few labels, bridging the gap between the simpler models and the more data intensive graph neural networks. The proposed model is also competitive with graph neural networks in settings where there are sufficient labelled data. While posterior inference for the proposed model is intractable for classification problems, scalable variational inducing point approximation method for Gaussian processes can be directly applied to perform inference. Despite the potentially large number of inducing points that need to be optimized, the model is protected from over-fitting by the variational lower bound, and does not require a validation data set for early stopping. We refer to the proposed model as the graph Gaussian process (GGP).
In this section, we briefly review key concepts in Gaussian processes and the relevant variational approximation technique. Additionally, we review the graph Laplacian, which is relevant to the alternative view of the model that we describe in Section 3.1. This section also introduces the notation used across the paper.
A Gaussian processand covariance kernel function , where denote the possible inputs that index the GP and is a set of hyper-parameters parameterizing the kernel function. We denote the GP as follows
GPs are widely used as priors on functions in the Bayesian machine learning literatures because of their wide support, posterior consistency, tractable posterior in certain settings and many other good properties. Combined with a suitable likelihood function as specified in Equation2, one can construct a regression or classification model that probabilistically accounts for uncertainties and control over-fitting through Bayesian smoothing. However, if the likelihood is non-Gaussian, such as in the case of classification, inferring the posterior process is analytically intractable and requires approximations. The GP is connected to the observed data via the likelihood function
The positive definite kernel function is a key component of GP that specifies the covariance of a priori. While is typically directly specified, any kernel function can be expressed as the inner product of features maps in the Hilbert space . The dependency of the feature map on is implicitly assumed for conciseness. The feature map projects into a typically high-dimensional (possibly infinite) feature space such that linear models in the feature space can model the target variable effectively. Therefore, GP can equivalently be formulated as
where is assigned a multivariate Gaussian prior distribution and marginalized. In this paper, we assume the index set to be without loss of generality.
For a detailed review of the GP and the kernel functions, please refer to (williams2006gaussian, ).
Despite the flexibility of the GP prior, there are two major drawbacks that plague the model. First, if the likelihood function in Equation 2 is non-Gaussian, posterior inference cannot be computed analytically. Secondly, the computational complexity of the inference algorithm is where is the number of training data points, rendering the model inapplicable to large data sets.
Fortunately, modern variational inference provides a solution to both problems by introducing a set of inducing points , where . The inducing points, which are variational parameters, index a set of random variables that is a subset of the GP function . Through conditioning and assuming is zero, the conditional GP can be expressed as
where and . Naturally, . The variational posterior distribution of , is assumed to be a multivariate Gaussian distribution with mean and covariance matrix . Following the standard derivation of variational inference, the Evidence Lower Bound (ELBO) objective function is
The variational distribution can be easily derived from the conditional GP in Equation 4 and , and its expectation can be approximated effectively using 1-dimensional quadratures. We refer the readers to (matthews2016scalable, ) for detailed derivations and results.
Given adjacency matrix of an undirected binary graph without self-loop, the corresponding graph Laplacian is defined as
where is the diagonal node degree matrix. The graph Laplacian can be viewed as an operator on the space of functions indexed by the graph’s nodes such that
where is the set containing neighbours of node . Intuitively, applying the Laplacian operator to the function results in a function that quantifies the variability of around the nodes in the graph.
The Laplacian’s spectrum encodes the geometric properties of the graph that are useful in crafting graph filters and kernels (shuman2013emerging, ; vishwanathan2010graph, ; bronstein2017geometric, ; chung1997spectral, ). As the Laplacian matrix is real symmetric and diagonalizable, its eigen-decomposition exists. We denote the decomposition as
where the columns of
are the eigenfunctions ofand the diagonal
contains the corresponding eigenvalues. Therefore, the Laplacian operator can also be viewed as a filter on functionre-expressed using the eigenfunction basis. Regularization can be achieved by directly manipulating the eigenvalues of the system (smola2003kernels, ). We refer the readers to (bronstein2017geometric, ; shuman2013emerging, ; chung1997spectral, ) for comprehensive reviews of the graph Laplacian and its spectrum.
Given a data set of size with -dimensional features , a symmetric binary adjacency matrix that represents the relational graph of the data points and labels for a subset of the data points, , with each , we seek to predict the unobserved labels of the remaining data points . We denote the set of all labels as .
The GGP specifies the conditional distribution , and predicts via the predictive distribution . The joint model is specified as the product of the conditionally independent likelihood and the GGP prior with hyper-parameters
. The latent likelihood parameter vectoris defined in the next paragraph.
First, the model factorizes as
where for the multi-class classification problem that we are interested in, is given by the robust-max likelihood (matthews2016scalable, ; girolami2006variational, ; kim2006bayesian, ; hernandez2011robust, ; hensman2015mcmc, ).
Next, we construct the GGP prior from a Gaussian process distributed latent function , , where the key assumption is that the likelihood parameter for data point is an average of the values of over its 1-hop neighbourhood as given by :
where , . We further motivate this key assumption in Section 3.1.
As has a zero mean function, the GGP prior can be succinctly expressed as a multivariate Gaussian random field
where and . A suitable kernel function for the task at hand can be chosen from the suite of well-studied existing kernels, such as those described in (duvenaud2014automatic, ). We refer to the chosen kernel function as the base kernel of the GGP. The matrix is sometimes known as the random-walk matrix in the literatures (chung1997spectral, ). A graphical model representation of the proposed model is shown in Figure 1.
The covariance structure specified in Equation 11 is equivalent to the pairwise covariance
where is the feature map that corresponds to the base kernel . Equation 12 can be viewed as the inner product between the empirical kernel mean embeddings that correspond to the bags of node features observed in the 1-hop neighborhood sub-graphs of node and , relating the proposed model to the Gaussian process distribution regression model presented in e.g. (flaxman2015supported, ).
More specifically, we can view the GGP as a distribution classification model for the labelled bags of node features , such that the unobserved distribution that generates is summarized by its empirical kernel mean embedding
The prior on can equivalently be expressed as . For detailed reviews of the kernel mean embedding and distribution regression models, we refer the readers to (muandet2017kernel, ) and (szabo2016learning, ) respectively.
One main assumption of the 1-hop neighbourhood averaging mechanism is homophily - i.e., nodes with similar covariates are more likely to form connections with each others (goldenberg2010survey, ). The assumption allows us to approximately treat the node covariates from a 1-hop neighbourhood as samples drawn from the same data distribution, in order to model them using distribution regression. While it is perfectly reasonable to consider multi-hops neighbourhood averaging, the homophily assumption starts to break down if we consider 2-hop neighbours which are not directly connected. Nevertheless, it is interesting to explore non-naive ways to account for multi-hop neighbours in the future, such as stacking 1-hop averaging graph GPs in a structure similar to that of the deep Gaussian processes (damianou2013deep, ; salimbeni2017doubly, ), or having multiple latent GPs for neighbours of different hops that are summed up in the likelihood functions.
In this section, we present an alternative formulation of the GGP, which results in an intuitive interpretation of the model. The alternative formulation views the GGP as a Bayesian linear model on feature maps of the nodes that have been transformed by a function related to the graph Laplacian .
As we reviewed in Section 2.1, the kernel matrix in Equation 11 can be written as the product of feature map matrix where row of corresponds to the feature maps of node , . Therefore, the covariance matrix in Equation 11, , can be viewed as the product of the transformed feature maps
where is the graph Laplacian matrix as defined in Equation 6. Isolating the transformed feature maps for node (i.e., row of ) gives
where is the degree of node and denotes row of a matrix. The proposed GGP model is equivalent to a supervised Bayesian linear classification model with a feature pre-processing step that follows from the expression in Equation 15. For isolated nodes , the expression in Equation 15 leaves the node feature maps unchanged ().
The term in Equation 15 can be viewed as a spectral filter , where and are the eigenmatrix and eigenvalues of the Laplacian as defined in Section 2.2. For connected nodes, the expression results in new features that are weighted averages of the original features and features transformed by the spectral filter. The alternative formulation opens up opportunities to design other spectral filters with different regularization properties, such as those described in (smola2003kernels, ), that can replace the expression in Equation 15. We leave the exploration of this research direction to future work.
In addition, it is well-known that many graphs and networks observed in the real world follow the power-law node degree distributions (goldenberg2010survey, ), implying that there are a handful of nodes with very large degrees (known as hubs) and many with relatively small numbers of connections. The nodes with few connections (small ) are likely to be connected to one of the handful of heavily connected nodes, and their transformed node feature maps are highly influenced by the features of the hub nodes. On the other hand, individual neighbours of the hub nodes have relatively small impact on the hub nodes because of the large number of neighbours that the hubs are connected to. This highlights the asymmetric outsize influence of hubs in the proposed GGP model, such that a mis-labelled hub node may result in a more significant drop in the model’s accuracy compared to a mis-labelled node with much lower degree of connections.
Posterior inference for the GGP is analytically intractable because of the non-conjugate likelihood. We approximate the posterior of the GGP using a variational inference algorithm with inducing points similar to the inter-domain inference algorithm presented in (van2017convolutional, ). Implementing the GGP with its variational inference algorithm amounts to implementing a new kernel function that follows Equation 12 in the GPflow Python package.111https://github.com/markvdw/GPflow-inter-domain
We introduce a set of inducing random variables indexed by inducing points in the same domain as the GP function . As a result, the inter-domain covariance between and is
Additionally, we introduce a multivariate Gaussian variational distribution for the inducing random variables with variational parameters and the lower triangular . Through Gaussian conditioning, results in the variational Gaussian distribution that is of our interest. The variational parameters and the kernel hyper-parameters are then jointly fitted by maximizing the ELBO function in Equation 5.
The computational complexity of the inference algorithm is . In the experiments, we chose to be the number of labelled nodes in the graph , which is small relative to the total number of nodes. Computing the covariance function in Equation 12 incurs a computational cost of per labelled node, where is the maximum node degree. In practice, the computational cost of computing the covariance function is small because of the sparse property of graphs typically observed in the real-world (goldenberg2010survey, ).
Graph-based learning problems have been studied extensively by researchers from both machine learning and signal processing communities, leading to many models and algorithms that are well-summarized in review papers (bronstein2017geometric, ; sandryhaila2014big, ; shuman2013emerging, ).
Gaussian process-based models that operate on graphs have previously been developed in the closely related relational learning discipline, resulting in the mixed graph Gaussian process (XGP) (silva2008hidden, ) and relational Gaussian process (RGP) (chu2007relational, ). Additionally, the renowned Label Propagation (LP)(zhu2003semi, ) model can also be viewed as a GP with its covariance structure specified by the graph Laplacian matrix (zhu2003semigp, ). The GGP differs from the previously proposed GP models in that the local neighbourhood structures of the graph and the node features are directly used in the specification of the covariance function, resulting in a simple model that is highly effective.
Models based on Laplacian regularization that restrict the node labels to vary smoothly over graphs have also been proposed previously. The LP model can be viewed as an instance under this framework. Other Laplacian regularization based models include the deep semi-supervised embedding (weston2012deep, ) and the manifold regularization (belkin2006manifold, ) models. As shown in the experimental results in Table 1, the predictive performance of these models fall short of other more sophisticated models.
Additionally, models that extract embeddings of nodes and local sub-graphs which can be used for predictions have also been proposed by multiple authors. These models include DeepWalk (perozzi2014deepwalk, ), node2vec (grover2016node2vec, ), planetoid (Yang, ) and many others. The proposed GGP is related to the embedding based models in that it can be viewed as a GP classifer that takes empirical kernel mean embeddings extracted from the 1-hop neighbourhood sub-graphs as inputs to predict node labels.
Finally, many geometric deep learning models that operate on graphs have been proposed and shown to be successful in graph-based semi-supervised learning problems. The earlier models including(li2015gated, ; scarselli2009graph, ; gori2005new, )
are inspired by the recurrent neural networks. On the other hand, convolution neural networks that learn convolutional filters in the graph Laplacian spectral domain have been demonstrated to perform well. These models include the spectral CNN(bruna2013spectral, ), DCNN (atwood2016diffusion, ), ChebNet (defferrard2016convolutional, ) and GCN (kipf2016semi, ). Neural networks that operate on the graph spectral domain are limited by the graph-specific Fourier basis. The more recently proposed MoNet (monti2017geometric, ) addressed the graph-specific limitation of spectral graph neural networks. The idea of filtering in graph spectral domain is a powerful one that has also been explored in the kernel literatures (smola2003kernels, ; vishwanathan2010graph, ). We draw parallels between our proposed model and the spectral filtering approaches in Section 3.1
, where we view the GGP as a standard GP classifier operating on feature maps that have been transformed through a filter that can be related to the graph spectral domain.
We present two sets of experiments to benchmark the predictive performance of the GGP against existing models under two different settings. In Section 5.1, we demonstrate that the GGP is a viable and extremely competitive alternative to the graph convolutional neural network (GCN) in settings where there are sufficient labelled data points. In Section 5.2, we test the models in an active learning experimental setup, and show that the GGP outperforms the baseline models when there are few training labels.
The semi-supervised classification experiments in this section exactly replicate the experimental setup in kipf2016semi , where the GCN is known to perform well. The three benchmark data sets, as described in Table 2, are citation networks with bag-of-words (BOW) features, and the prediction targets are the topics of the scientific papers in the citation networks.
The experimental results are presented in Table 1, and show that the predictive performance of the proposed GGP is competitive with the GCN and MoNet (monti2017geometric, ) (another deep learning model), and superior to the other baseline models. While the GCN outperforms the proposed model by small margins on the test sets with data points, it is important to note that the GCN had access to 500 additional labelled data points for early stopping. As the GGP does not require early stopping, the additional labelled data points can instead be directly used to train the model to significantly improve the predictive performance. To demonstrate this advantage, we report another set of results for a GGP trained using the 500 additional data points in Table 1, in the row labelled as ‘GGP-X’. The boost in the predictive performances shows that the GGP can better exploit the available labelled data to make predictions.
The GGP base kernel of choice is the 3rd degree polynomial kernel, which is known to work well with high-dimensional BOW features (williams2006gaussian, ). We re-weighed the BOW features using the popular term frequency-inverse document frequency (TFIDF) technique (sparck1972statistical, ). The variational parameters and the hyper-parameters were jointly optimized using the ADAM optimizer (kingma2014adam, ). The baseline models that we compared to are the ones that were also presented and compared to in (kipf2016semi, ) and (monti2017geometric, ).
Active learning is a domain that faces the same challenges as semi-supervised learning where labels are scarce and expensive to obtain (zhu2005semi, ). In active learning, a subset of unlabelled data points are selected sequentially to be queried according to an acquisition function, with the goal of maximizing the accuracy of the predictive model using significantly fewer labels than would be required if the labelled set were sampled uniformly at random (balcan2010true, ). A motivating example of this problem scenario is in the medical setting where the time of human experts is precious, and the machines must aim to make the best use of the time. Therefore, having a data efficient predictive model that can generalize well with few labels is of critical importance in addition to having a good acquisition function.
In this section, we leverage GGP as the semi-supervised classification model of active learner in graph-based active learning problem (zhu2005semi, ; ma2013sigma, ; dasarathy2015s2, ; jun2016graph, ; MacAodhaCVPR2014, ). The GGP is paired with the proven -optimal (SOPT) acquisition function to form an active learner (ma2013sigma, ). The SOPT acquisition function is model agnostic in that it only requires the Laplacian matrix of the observed graph and the indices of the labelled nodes in order to identify the next node to query, such that the predictive accuracy of the active learner is maximally increased. The main goal of the active learning experiments is to demonstrate that the GGP can learn better than both the GCN and the Label Propagation model (LP) (zhu2003semi, ) with very few labelled data points.
Starting with only 1 randomly selected labelled data point (i.e., node), the active learner identifies the next data point to be labelled using the acquisition function. Once the label of the said data point is acquired, the classification model is retrained and its test accuracy is evaluated on the remaining unlabelled data points. In our experiments, the process is repeated until 50 labels are acquired. The experiments are also repeated with 10 different initial labelled data points. In addition to the SOPT acquisition function, we show the results of the same models paired with the random acquisition function (RAND) for comparisons.
The test accuracies with different numbers of labelled data points are presented as learning curves in Figure 2. In addition, we summarize the results numerically using the Area under the Learning Curve (ALC) metric in Table 3. The ALC is normalized to have a maximum value of 1, which corresponds to a hypothetical learner that can achieve test accuracy with only label. The results show that the proposed GGP model is indeed more data efficient than the baselines and can outperform both the GCN and the LP models when labelled data are scarce.
The benchmark data sets for the active learning experiments are the Cora and Citeseer data sets. However, due to technical restriction imposed by the SOPT acquisition function, only the largest connected sub-graph of the data set is used. The restriction reduces the number of nodes in the Cora and Citeseer data sets to and respectively. Both of the data sets were also used as benchmark data sets in (ma2013sigma, ).
We pre-process the BOW features with TFIDF and apply a linear kernel as the base kernel of the GGP. All parameters are jointly optimized using the ADAM optimizer. The GCN and LP models are trained using the settings recommended in (kipf2016semi, ) and (ma2013sigma, ) respectively.
We propose a Gaussian process model that is data-efficient for semi-supervised learning problems on graphs. In the experiments, we show that the proposed model is competitive with the state-of-the-art deep learning models, and outperforms when the number of labels is small. The proposed model is simple, effective and can leverage modern scalable variational inference algorithm for GP with minimal modification. In addition, the construction of our model is motivated by distribution regression using the empirical kernel mean embeddings, and can also be viewed under the framework of filtering in the graph spectrum. The spectral view offers a new potential research direction that can be explored in future work.
This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1.
The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.IEEE Signal Processing Magazine, 30(3):83–98, 2013.