 # Gaussian Processes on Graphs via Spectral Kernel Learning

We propose a graph spectrum-based Gaussian process for prediction of signals defined on nodes of the graph. The model is designed to capture various graph signal structures through a highly adaptive kernel that incorporates a flexible polynomial function in the graph spectral domain. Unlike most existing approaches, we propose to learn such a spectral kernel, where the polynomial setup enables learning without the need for eigen-decomposition of the graph Laplacian. In addition, this kernel has the interpretability of graph filtering achieved by a bespoke maximum likelihood learning algorithm that enforces the positivity of the spectrum. We demonstrate the interpretability of the model in synthetic experiments from which we show the various ground truth spectral filters can be accurately recovered, and the adaptability translates to superior performances in the prediction of real-world graph data of various characteristics.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphs are highly useful data structures that represent relationships and interactions between entities. Such relational structures are commonly observed in the real-world, but can also be artificially constructed from data according to heuristics. The graph structure can be exploited in conjunction with other auxiliary data to build more powerful predictive models. One particular class of models that can be enhanced for graph data is Gaussian processes (GP). As a kernel method, GPs can be adapted to incorporate topological information through kernels derived on graphs. With the kernel defined, the standard Bayesian inference machinery can be directly applied to yield predictions.

Multi-output Gaussian processes (MOGP) are regression models for vector-valued data. Given a set of input covariates and the corresponding output vectors, the model makes vectorial predictions given a novel input covariate. In graph signal prediction problems, each output signal, indexed by a corresponding input, can be viewed as a vector where the dependency between elements is encoded in the graph structure. The dependency between the signals can then be modeled using a kernel on the inputs (e.g., the squared exponential kernel). The formulation of separable kernels for MOGP, as is the case in co-regionalization model in

alvarez2012kernels , makes choosing the overall kernel function straight forward. The two kernels operating on the inputs and outputs can be designed separately and combined by means of a Kronecker product. We refer to the kernel operating on the input as kernel on the input space, and the kernel operating on the signals as kernel on the output space, where the latter provides a measure of smoothness between data observed on the nodes.

Smola et al smola2003kernels

have introduced the notion of kernel on graphs, where kernel functions between nodes were derived from a regularization perspective by specifying a function in the graph spectral domain. The resulting kernel is based on the graph Laplacian, and this is closely related to graph signal processing, which makes use of tools such as graph Fourier transform and filtering

shuman2012emerging ; ortega2018graph ; chen2014signal . One particular low-pass filter defined in shuman2012emerging , commonly used to de-noise graph signals, also assumes the form of kernels on graphs. This was subsequently used in venkitaraman2018gaussian ; venkitaraman2017kernel for predicting low pass graph signals. However, not all data exhibits a low-pass spectrum, and this filter will not work well on data of band- and high-pass spectrums. The same limitation also applies to other existing GP models developed for graph-structured data such as ng2018bayesian , where the relationship between the node observations is defined a priori. Addressing this limitation requires a different choice of kernels with a spectrum that better compliments the characteristics of the data.

Learning kernels in the spectral domain have been studied in the continuous case such as Wilson2013GaussianPK ; samo2015generalized ; li2019automated , but the extension of the approach to learning on a discrete graph space has yet to be explored with the use of a GP. In this paper, we propose a MOGP model that uses a kernel on graphs on the output space, with the addition that the spectrum of the kernel is learnt. The model is designed to capture various graph signal structures by incorporating a flexible polynomial function in the graph spectral domain, producing a highly adaptable model. The polynomial function is learnt by maximizing the log-marginal likelihood while respecting a constraint to enforce the positivity of the spectrum. The positivity constraint allows for a meaningful interpretation of the learnt models as graph filters, giving the modelers insights on the characteristics of the data. Lastly, we demonstrate that our algorithm can recover ground truth filters applied to synthetic data, and show the adaptability of the model on real-world data with different spectral characteristics.

## 2 Background

### 2.1 Gaussian Processes

A GP is defined as

 f(x)∼GP(m(x),K(x,x′)) (1)

for any inputs , where is the mean function, and

is the symmetric and positive definite kernel function. In machine learning, GPs are widely employed models for making predictions with uncertainty in both regression and classification settings. We will refer readers to

rasmussen2003gaussian for a more thorough description of the GP model.

### 2.2 Spectral Filtering on Graphs

We define the notion of spectral filtering on graphs from the graph Laplacian shuman2012emerging defined as , where is the adjacency matrix and is the diagonal degree matrix of a given graph . Assuming that is undirected, the Laplacian admits the eigen-decomposition where

contains the eigenvectors and

is the diagonal matrix of eigenvalues. For a graph signal

, the graph Fourier transform defined as computes the spectrum of to produce the amplitude of each eigenvector (frequency) component. Filtering then involves a function in the graph spectral domain that may reduce or amplify each component leading to a filtered signal

 Ug(Λ)U⊤y. (2)

The term is therefore referred to as a filtering function on the graph characterized by .

### 2.3 Kernels and Regularization on Graphs

A property of kernel functions is provided by Bochner’s theorem loomis2013introduction , which states that positive definite functions have non-negative measures as the spectrum in the spectral domain. On the discrete graph space kernels are derived by the graph Fourier transform and a non-negative transfer function. In this section we make use of the general formulation of kernel on graphs described in smola2003kernels .

The graph Laplacian can be used to quantify the smoothness of a function on graphs by measuring how much they vary locally smola1998regularization ; smola2001regularization . When finding a smooth model for graph signal , it is common to solve for the following regularized problem

 minf||f−y||22+R(f), (3)

where we have the regularization function on . In the graph case, where often takes the form of a function of the graph Laplacian that penalizes specific graph spectral components of . The kernel function is then computed by , with pseudoinverse used if is singular bozzo2013moore . More generally, kernels on graphs assume the following form

 M∑i=1r−1(λi)viv⊤i=Ur−1(Λ)U⊤=r−1(L) (4)

for the eigenvalues and eigenvectors of the graph Laplacian, and penalty function . Furthermore, this definition is flexible and using other versions such as the normalized Laplacian and scaled Laplacian will both lead to valid kernels.

## 3 Proposed Model

### 3.1 Gaussian Processes for Graph Signals

From a generative model perspective, the observed signal on a graph with nodes can be viewed as a realization of a filtering system where is the graph filter, and is a simple MOGP function with independent components evaluated at input . The elements in are assumed to be independent GPs with identical kernel function on inputs and . This leads to , where

is an identity matrix. Cross covariance and graph structures in

are therefore induced by the filtering matrix , giving rise to the following model

 yn=Bf(xn)+ϵn, (5)

where . The model in Eq. (5) is generic in the sense that, depending on the design of , we can incorporate any characteristics of the signal in the graph spectral domain.

The prior covariance between two signals and can be computed as , and if we let , the covariance of the full data becomes

 \textmdCov(~y)=K⊗BB⊤+σ2ϵIMN, (6)

where , and denotes the Kronecker product. The term can be thought of as a kernel between elements of each outputs , while operates on signals’ inputs and . Generally, will be referred to as the kernel on the input space, while we will call the graph dimension or the kernel on the output space.

We now state our main model for prediction of graph signals. Given the GP prior on the latent , the training signals and test signal with given input

 (7)

For the model covariance, and can be any existing kernel such as the squared exponential or Matérn kernel. In this work, we consider as a kernel on graphs based on the scaled graph Laplacian, and follow the general form (4) as , where and now correspond to the eigenvalues and eigenvectors of , and is the function of in the graph spectral domain. The resulting can be interpreted as both a filter and a kernel on the graph.

### 3.2 Graph Spectral Kernel Learning

To further enhance our model, we propose to learn the function rather than using an existing kernel on graphs to make the model adaptive. We parameterize as a finite polynomial function

 g(λ)=β0+β1λ+⋯+βPλP⟹B=P∑i=0βiLiS, (8)

with coefficients learnt via log-marginal likelihood maximization. We will use gradient optimization to learn these coefficients which we will go into more details in the next section. A suitable choice for the degree can be found via a validation procedure; in practice, we find that a choice of often leads to satisfactory performances.

There are a number of advantages of our model setup described above, in particular:

• [noitemsep]

• The kernel on graphs is learnt rather than chosen a priori, and the function that characterizes the kernel is a flexible polynomial making the model highly adaptable to data with different spectral properties. Moreover, existing choices provided in smola2003kernels all consist of functions that have polynomial expansions. Hence our model provides suitable approximations if data came from a more complex generative model.

• The scaled Laplacian ensures the eigenvalues lie in the full range regardless of the graph. This bounds the values in the polynomial, and helps prevent the range of the eigenvalues from affecting the shape of . Other alternatives such as the normalized Laplacian often found in the literature of graph signal processing shuman2012emerging bounds the eigenvalues to be within and, by subtracting the identity matrix, shifts the eigenvalues to the range . However, we find that often the eigenvalues are not spread over the full range , thus the polynomial is only defined partially over the range.

• The application of the power of the Laplacian corresponds to filtering restricted to the -hop neighbourhood of the nodes. Our polynomial is finite, thus the user can control the localization in the kernel, a property that is often desirable in models such as the GCN kipf2016semi .

• The polynomial setup leads to several computational savings: the kernel spectrum is defined as a function of directly, hence eigen-decomposition of is not required; additionally, the linearity of means differentiating is efficient for gradient optimization.

### 3.3 Equivalence to the Co-regionalization Model

The prior model in (7) follows the form of separable kernels similar to the co-regionalization model in the literature of multi-output GP alvarez2012kernels . Our derivation specify the kernel on the output space more directly, but in this section we show how the two are equivalent. Starting with the model for a GP function , under the setup of intrinsic co-regionalization model (ICM) alvarez2012kernels , we have

 f(xn)=S∑i=1biui(xn) (9)

where are i.i.d. variables following and for all . This leads to a model whose covariance is

 S∑i=1S∑j=1bib⊤jE(ui(xn)uj(xm))=S∑i=1bib⊤iE(ui(xn)ui(xm))=K(xn,xm)S∑i=1bib⊤i. (10)

Denoting , we can see that , thus the covariance can be written as . When we have intput-output data pairs, the full covariance of will follow the separable form . Since a kernel on graphs is usually a square matrix, our graph GP model is equivalent to ICM if and the vectors combine into a matrix that takes the general form of Eq. (4).

As an additional note, the covariance we derive is dependent on the manner in which are stacked into a single vector. If we take instead, we will get the covariance . These are simply different ways to represent the prior covariance, and and still operate on the output and input space respectively.

## 4 Optimizing GP Log-Marginal Likelihood

The polynomial coefficients in the kernel on graphs are found by maximizing the log-marginal likelihood on a training set using gradient optimization. Let , and let contain

and other hyperparameters, the GP log-marginal likelihood is

 l(Ω)=logP(~y|Ω)=−12log|ΣΩ|−12~y⊤Σ−1Ω~y−NM2log(2π). (11)

As described in Eq. (5), the term also acts as a filter on the GP prior to incorporate information from the graph structure. In order for to be a valid filter and a kernel on the graph, we need to constrain to be positive semi-definite (PSD); in other words, we need to have for all eigenvalues shuman2012emerging ; alvarez2012kernels . Just optimizing alone in an unconstrained fashion will not guarantee this, thus we utilize Lagrange multipliers to combine constraints with our main objective function.

Assuming all other hyperparameters are fixed, our constrained optimization problem for finding the optimal kernel on graphs is the following

 minβ−l(β)\textmdsubjectto−Bvβ≤0, (12)

where we express the log-marginal likelihood as a function of and is the Vandermonde matrix of eigenvalues of the Laplacian with the following form

 Bv=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1λ1λ21…λP11λ2λ22…λP2⋮⋮⋮⋱⋮1λMλ2M…λPM⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (13)

It is easy to see that to have for all eigenvalues is equivalent to setting . Hence, our objective function now becomes

 L(β,L)=−l(β)+L⊤(−Bvβ)=−l(β)−L⊤Bvβ (14)

where is a vector of Lagrange multipliers. The solution to this problem is guided by the Karush–Kuhn–Tucker (KKT) conditions kuhn2014nonlinear , which specifies that is the optimal solution to (12) if is the solution to . Due to the non-convexity of the log-likelihood, we instead solve for the dual problem gasimov2002augmented ; bazaraa1979survey

 maxL≥0minβL(β,L). (15)

We find the solution by alternatively updating and described in Algorithm 1. Here, is replaced with and we solve for instead to keep the Lagrange multipliers positive during the optimization.

Due to the non-convexity of (15), Algorithm 1 may only find a local optimum depending on the initialization. A simple strategy to obtain a sensible initialization is to optimize for the log-marginal likelihood (without the constraint on ) using gradient ascent, with initializations chosen from a small set of values that lead to the highest log-marginal likelihood. The solution to this unconstrained optimization is then used as the initialization for Algorithm 1. The algorithm is much more stable with respect to the initialization of the Lagrange multipliers, and using either a fixed or random initialization work well in practice.

## 5 Related Work

Learning on graph-structured data has been studied from both machine learning and signal processing perspectives such as shuman2012emerging ; ortega2018graph ; bronstein2017geometric ; wu2019comprehensive . Our model is unique in that it makes use of tools from both fields to achieve interpretations of filtering and kernel learning in the graph spectral domain.

Laplacian-based functions in graph signal processing such as graph filters have been applied to data with certain smoothness assumptions, thus transforming data into one of low-, band- or high-pass profiles shuman2012emerging ; ortega2018graph . In contrast, our algorithm learns the filter based on the data to exempt the need to choose the filter profile a priori. This extends the non-probabilistic approach in thanou2013parametric with the added benefit of producing a measure of uncertainty. On the other hand, some learning frameworks in the graph spectral domain such as bruna2013spectral requires eigen-decomposition of the Laplacian which is of

, while our model avoids this problem as the kernel spectrum is specified without explicitly going into the graph spectral domain. More recent spectral graph neural networks

defferrard2016convolutional ; kipf2016semi ; wu2019comprehensive avoid this issue but typically require a large amount of training data; in comparison, our model is data-efficient and only requires a small number of training signals as shown in the experimental results.

Previous works relating to GP on graphs are mainly applied to scalar outputs. The way the graph is utilized follows a similar framework to graph neural network models such as kipf2016semi , with one representative approach being local averaging of neighbourhood information for node level classification ng2018bayesian . More complicated aggregation functions have since been applied as a linear function to the GP covariance in opolka2020graph ; chengdynamic ; liuuncertainty . Although these models may be extended to a vector output, they generally involve averaging or smoothing of the data, and the resulting effect is similar to a low-pass filter on the graph. Hence, these models are likely to perform less well on data that are not customarily smooth. Our model overcomes this issue through spectral learning of the kernel on graphs to adapt to the data more effectively. Finally, the convolutional patch technique in van2017convolutional has also been extended to graph data walker2019graph . This method can be viewed as an extension to the approach in ng2018bayesian , but it is still based on pre-defined kernel functions in the graph domain.

## 6 Experiments

In this section, we first present results on synthetic experiments to demonstrate our algorithm’s ability to recover ground truth filter shapes. We then apply our method to several real-world datasets that exhibit different spectral characteristics to show the adaptability of our model.

In all experiments, the GP prior will be in the form of Eq. (7) and we consider the following baseline models: global filtering venkitaraman2018gaussian , local averaging ng2018bayesian , graph Laplacian regularization (pseudo-inverse of the Laplacian) alvarez2012kernels , and the simplest which we refer to as standard GP. The input kernel will be squared exponential applied to inputs and , giving a total set of hyperparameters to be . All hyperparameters in the baselines are found by maximizing the log-marginal likelihoods by gradient ascent, with the exception that our model optimizes using Algorithm 1, with the initialization described at the end of Section 4 (details of the implementation is presented in Supplementary Material). The predictive performance will be evaluated by the posterior log-likelihoods for test signals , with GP posterior mean and covariance .

### 6.1 Synthetic Signals

For the first experiment we use synthetic signals which are generated following Eq. (5) using a with a known polynomial chosen beforehand. The aim is to demonstrate that our model can recover the polynomial shapes of the ground truth filters through optimizing the GP log-marginal likelihood.

We set the underlying graph to be a 30 nodes Sensor graph from the pygsp library defferrard2017pygsp . The Sensor graph has an even spread of eigenvalues which helps the visualization of the polynomial (further results on other random graph models are presented in Supplementary Material). Signals are first sampled independently as . Using the scaled graph Laplacian , we denote the ground truth filter as with coefficients . Each synthetic signal is then set as

and we corrupt it with noise at a signal-to-noise ratio (SNR) of 10 dB. As the signals are sampled independently, the kernel function is

where

is set to signal variance. We denote the polynomials learnt from our algorithm as

for degree which has coefficients. If the goes above 1 for any , we can scale it down as for . The resulting will be in the range making it easier to compare different filters, and the term can be absorbed into the variance of the full kernel function, alleviating the need to optimize for in .

In Fig. 1, we show the results from learning on synthetic data with low- and band-pass spectrum (a high-pass spectrum will simply have the reversed shape of the low-pass so we will not present here due to the similarity). In Fig. 0(a) and Fig. 0(d) we plot the graph Fourier coefficients of the generated signals, with each colour corresponding to one signal. The learnt polynomials with different degrees can be found in Fig. 0(b) and Fig. 0(e) along with the ground truth polynomial . Visually we can see that using a polynomial with and respectively capture the ground truths of low- and high-pass filters well enough that higher degree no longer offers clear improvement. This is also evident in the log-marginal likelihoods, where we see only little improvement for for low-pass and for band-pass spectrums.

We next study the effect of noise on learning the spectrum, using a degree 2 polynomial for low-pass and degree 3 for band-pass. Fig. 0(c) and 0(f) show the spectrum learnt for various SNRs, where we can see visually that our model recovers the true spectrum well for SNR 10 dB or higher. As expected, the corresponding marginals steadily decrease as SNR decreases when data becomes noisier.

### 6.2 Traffic Dataset

In the first real-world experiment we consider the daily traffic bottleneck stations in San Francisco choe2002freeway ; thanou2013parametric . Stations corresponding to nodes are connected in the graph if Euclidean distances are less than a threshold of 13 kilometers with inverse distance as edge weights. The signal is the average time (in minutes) that each bottleneck is active on a specific day. The graph consist of 75 nodes, where we use the data on the first 15 nodes as input , and predict on the remaining 60 nodes giving us

. We find the hyperparameters on the training set followed by conditioning to compute the posterior. The test signals are split 10-fold, with posterior log-likelihoods computed on each fold to provide an average test log-likelihood and standard error. The graph signals exhibit large Fourier coefficients for eigenvalues near 0 and 1, making the baseline models less suitable. The results are found in the first two columns of Table

1 for a training set of 10 and 20 signals, and the shapes of the kernel spectrum are presented in Fig. 1(a). We used up to degree 3 polynomials which were all able to outperform other models by up to 3 standard errors for both training sizes.

### 6.3 fMRI Dataset

We next consider data from functional magnetic resonance imaging (fMRI) where an existing graph of 4465 nodes corresponds to different voxels of the cerebellum region in the brain (we refer to venkitaraman2018gaussian ; behjat2016signal for more details on graph construction and signal extraction). A graph signal is the blood-oxygen-level-dependent (BOLD) signal observed on the voxels. We use the graph of the first 50 nodes, taking the readings on the first 10 nodes as to predict the outcome signals on the remaining 40 nodes (). The dataset contains 292 signals for which we train on a sample of up to 42 signals to learn the hyperparameters. We then compute the posterior to predict the remaining 250 signals, which are split 10-fold to provide a mean test log-likelihood and standard error. This dataset follows a low-pass spectrum that suits kernels in the baseline models, in particular global filtering venkitaraman2018gaussian , thus our algorithm is expected to produce similar performances. The results can be found in the middle two columns of Table 1 and the learnt polynomials in Fig. 1(b), we see the posterior log-likelihoods of our models for degree 2 and 3 is superior to other models but they do not offer significant improvement (the model from alvarez2012kernels cannot be applied due to singularity caused by disconnected graph).

### 6.4 Weather Dataset

The last dataset is the temperature measurement in 45 cities in Sweden available from the SMHI swedish . The data also follows a low-pass spectrum making the baselines from ng2018bayesian and venkitaraman2018gaussian suitable models. Using the cities’ longitude and latitude, we construct a -nearest neighbour graph for using pygsp defferrard2017pygsp . We perform the task of next-day prediction, where given as the temperature signal at day , we aim to predict as the temperature signal on day . We randomly sample 30 signal pairs for hyperparameter learning, and predict the signals on the remaining 60 days. The results are in the final two columns of Table 1 and filter shapes in Fig. 1(c), where we see degree 2 and 3 polynomials are performing significantly better than other models, while degree 1 performs relatively similar to the next closest in global filtering venkitaraman2018gaussian .

## 7 Conclusion

We have developed a novel GP-based method for graph-structured data to capture the inter-dependencies between observations on a graph. The kernel on graphs adapts to the characteristics of the data by using a bespoke learning algorithm that also provides a better interpretability of the model from a graph filtering perspective. Our model has produced superior performances on highly non-smooth data while results were competitive with the baselines on data that are generally smoother. Promising future directions include the extension of the model for application in classification and improvement in scalability of the model.

## References

•  Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.
•  Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory and kernel machines, pages 144–158. Springer, 2003.
•  David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst.

The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.

IEEE signal processing magazine, 30(3):83–98, 2013.
•  Antonio Ortega, Pascal Frossard, Jelena Kovačević, José MF Moura, and Pierre Vandergheynst. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 106(5):808–828, 2018.
•  Siheng Chen, Aliaksei Sandryhaila, José MF Moura, and Jelena Kovacevic. Signal denoising on graphs via graph filtering. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 872–876. IEEE, 2014.
•  Arun Venkitaraman, Saikat Chatterjee, and Peter Händel. Gaussian processes over graphs. arXiv preprint arXiv:1803.05776, 2018.
•  Arun Venkitaraman, Saikat Chatterjee, and Peter Händel. Kernel regression for signals over graphs. arXiv preprint arXiv:1706.02191, 2017.
•  Yin Cheng Ng, Nicolò Colombo, and Ricardo Silva.

Bayesian semi-supervised learning with graph gaussian processes.

In Advances in Neural Information Processing Systems, pages 1683–1694, 2018.
•  Andrew Gordon Wilson and Ryan P. Adams. Gaussian process kernels for pattern discovery and extrapolation. In ICML, 2013.
•  Yves-Laurent Kom Samo and Stephen Roberts. Generalized spectral kernels. arXiv preprint arXiv:1506.02236, 2015.
•  Jian Li, Yong Liu, and Weiping Wang. Automated spectral kernel learning. arXiv preprint arXiv:1909.04894, 2019.
•  Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.
•  Lynn H Loomis. Introduction to abstract harmonic analysis. Courier Corporation, 2013.
•  Alex J Smola and Bernhard Schölkopf. From regularization operators to support vector kernels. In Advances in neural information processing systems, pages 343–349, 1998.
•  Alex J Smola, Zoltan L Ovari, and Robert C Williamson. Regularization with dot-product kernels. In Advances in neural information processing systems, pages 308–314, 2001.
•  Enrico Bozzo. The moore–penrose inverse of the normalized graph laplacian. Linear Algebra and its Applications, 439(10):3038–3043, 2013.
•  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
•  Harold W Kuhn and Albert W Tucker. Nonlinear programming. In Traces and emergence of nonlinear programming, pages 247–258. Springer, 2014.
•  Rafail N Gasimov. Augmented lagrangian duality and nondifferentiable optimization methods in nonconvex programming. Journal of Global Optimization, 24(2):187–203, 2002.
•  Mokhtar S Bazaraa and Jamie J Goode. A survey of various tactics for generating lagrangian multipliers in the context of lagrangian duality. European Journal of Operational Research, 3(4):322–338, 1979.
•  Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.

Geometric deep learning: going beyond euclidean data.

IEEE Signal Processing Magazine, 34(4):18–42, 2017.
•  Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
•  Dorina Thanou, David I Shuman, and Pascal Frossard. Parametric dictionary learning for graph signals. In 2013 IEEE Global Conference on Signal and Information Processing, pages 487–490. IEEE, 2013.
•  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In ICLR, 2014.
•  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
•  Felix L Opolka and Pietro Liò. Graph convolutional gaussian processes for link prediction. arXiv preprint arXiv:2002.04337, 2020.
•  Pengyu Cheng, Yitong Li, Xin yuan Zhang, Liqun Cheng, David A. Carlson, and Lawrence Carin. Dynamic embedding on textual networks via a gaussian process. arXiv: Learning, 2019.
•  Zhaoyang Liu, ShaoYuan Li, Song can Chen, Yao Hu, and Sheng-Jun Huang. Uncertainty aware graph gaussian process for semi-supervised learning. In AAAI, 2020.
•  Mark Van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian processes. In Advances in Neural Information Processing Systems, pages 2849–2858, 2017.
•  Ian Walker and Ben Glocker. Graph convolutional gaussian processes. In ICML, 2019.
•  Michaël Defferrard, Lionel Martin, Rodrigo Pena, and Nathanaël Perraudin. Pygsp: Graph signal processing in python. https://github. com/epfl-lts2/pygsp, 2017.
•  Tom Choe, Alexander Skabardonis, and Pravin Varaiya. Freeway performance measurement system: operational analysis tool. Transportation research record, 1811(1):67–75, 2002.
•  Hamid Behjat, Ulrike Richter, Dimitri Van De Ville, and Leif Sörnmo. Signal-adapted tight frames on graphs. IEEE Transactions on Signal Processing, 64(22):6017–6029, 2016.
•  Swedish Meteorological and Hydrological Institute (SMHI). [Online]. Accessed: 2019-02-13.

## Appendix A Initialization Strategy

Due to the highly non-convex structure of the GP log-marginal likelihood, optimizing hyperparameters is heavily reliant on the initializations. Here, we propose a procedure of steps to get the best and most stable solution for Algorithm 1. We are aware that there may be more generalizable initialization strategies; what is presented here is the initialization we found that work well for our problem.

Based on a training set , the set of hyperparameters to learn is , from which we initialize

 l =\textmdMean({||y1||22,…,||yN||22}) (16) σ2w =\textmdVar({y1,…,yN}). (17)

We set the other parameters by trying a small range of values, using the combination that leads to the highest log-marginal likelihood as the initialization. Our procedure is as follows:

1. Find the optimal and that maximizes the log-marginal likelihood by a grid search.

2. Use the best combinations from grid search as initializations (along with initial and ) for the unconstrained problem - maximizing the log-marginal likelihood with respect to ( is indirectly optimized through as explained in Section 6.1).

3. Use the solution found in step 2 as the initializations to Algorithm 1 and solve for , while keeping all other hyperparameters constant.

In the grid search we use , while for each elements we use for low-pass, and for more complicated spectrums such as band-pass and that of the traffic data.

As a final note, we follow a general rule for selecting the learning rate for each hyperparameter (, etc) as: choosing the largest such that for hyperparameter , which leads to a consistent increase/decrease in the objective function. This will require some tuning from the user beforehand in order to ensure the algorithm converges in a reasonable time.

## Appendix B Synthetic Experiments using Barabási–Albert Random Graph

In order to show that our algorithm can generalize to different graphs, we also tried recovering graph filters using on a Barabási–Albert (BA) random graph in place of the Sensor graph described in Section 6.1. The graph contains 30 nodes generated with an initial 10 nodes, and each node added will be randomly connected to 5 existing nodes. Signals are generated using the same form as in the Sensor graph example, as well as using the same full GP kernel function. In Fig. 2(a) and 2(c) we can visually see the distribution of the eigenvalues are less uniform, while Fig. 2(b) and 2(d) show the low- and band-pass filter shapes are still recovered relatively well using Algorithm 1.

## Appendix C Synthetic Experiments with Pre-Defined Kernel on Input Space

In our synthetic experiments in Section 6.1, we use due to the signals being sampled independently. This allows us to see the full effect of the kernel on the output space, but cannot be used for inference as predictions from the posterior will be the same as the prior. We next present a more comprehensive synthetic generative model that defines the covariance between signals. This will also act as a sanity check for our algorithm’s ability to recover graph filters when paired with different input kernels.

We start by sampling a positive definite matrix from inverse Wishart distribution with identity hyperparameter, where is of size . We then draw samples from to create our data matrix of size . Each row in is of dimension and so can be interpreted as a signal on the graph. Next, let denote the th row of , we filter this signal by

 yi=θ(LS)ri. (18)

The signals and of will have covariance , and elements of each will be determined by the filter . Hence can be used as the covariance matrix on the input space, and the full kernel of the GP becomes

 C⊗BB⊤+σ2ϵI. (19)

We present the recovered filter shapes in Fig. 4 for the two previously used ground truth filter shapes on a Sensor graph. The results are similar as degree 2 and 3 polynomials are able to respectively recover low- and band-pass filters well, both visually and in terms of log-marginal likelihood.

## Appendix D Additional Information for Real-World Experiments

We present in Fig. 4(a) to 4(c) the graphs and the graph Fourier spectrum of the data in the three real-world experiments. In particular, for the traffic and fMRI data the graph is split into yellow and purple nodes, where the data observed on these nodes will correspond to inputs and outputs respectively for the GP. The signals we work with is on a graph that only consists of the purple nodes; all connections related to the input nodes are thereby ignored. On the weather dataset we carry out next-day prediction so each signal is on the full graph. The graph Fourier spectrums of the data are presented in Fig. 4(d) to 4(f), where the corresponding filters learnt are presented in the main text and can be found in Fig. 2.