Kernel-based Graph Learning from Smooth Signals: A Functional Viewpoint

08/23/2020 ∙ by Xingyue Pu, et al. ∙ University of Oxford 40

The problem of graph learning concerns the construction of an explicit topological structure revealing the relationship between nodes representing data entities, which plays an increasingly important role in the success of many graph-based representations and algorithms in the field of machine learning and graph signal processing. In this paper, we propose a novel graph learning framework that incorporates the node-side and observation-side information, and in particular the covariates that help to explain the dependency structures in graph signals. To this end, we consider graph signals as functions in the reproducing kernel Hilbert space associated with a Kronecker product kernel, and integrate functional learning with smoothness-promoting graph learning to learn a graph representing the relationship between nodes. The functional learning increases the robustness of graph learning against missing and incomplete information in the graph signals. In addition, we develop a novel graph-based regularisation method which, when combined with the Kronecker product kernel, enables our model to capture both the dependency explained by the graph and the dependency due to graph signals observed under different but related circumstances, e.g. different points in time. The latter means the graph signals are free from the i.i.d. assumptions required by the classical graph learning models. Experiments on both synthetic and real-world data show that our methods outperform the state-of-the-art models in learning a meaningful graph topology from graph signals, in particular under heavy noise, missing values, and multiple dependency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fig. 1: A functional viewpoint of graph learning: (a) The observation matrix with missing and noisy entries; (b) Each row of the observation matrix is modelled as samples obtained at a collection of fixed locations (considered as nodes in a graph) from an underlying function (top); Values at each node are determined by the underlying function, as well as node-side information and observation-side information (bottom); (c) The learned graph topology.

Modelling based on graphs has recently attracted an increasing amount of interest in machine learning and signal processing research. On the one hand, many real-world data are intrinsically graph-structured, e.g. individual preferences in social networks or environmental monitoring data from sensor networks. This makes graph-based methods a natural approach to analysing such structured data. On the other hand, graphs are an effective modelling language for revealing relational structure in complex domains and may assist in a variety of learning tasks. For example, knowledge graphs improve the performance in semantic parsing and question answering

[1]. Despite their usefulness, however, a graph is not always readily available or explicitly given. The problem of graph learning therefore concerns the construction of a topological structure among entities from a set of observations on these entities.

Methodologies to learn a graph from the structured data include naïve methods such as -nearest neighbours (

-NN), and approaches from the literature of probabilistic graphical models (PGMs) and more recently graph signal processing (GSP) and graph neural networks (GNNs). The basic idea of

-NN is to connect a node to other nodes with the smallest pairwise distances in terms of the observations [2, 3, 4, 5]

. In PGMs, a graph expresses the conditional dependence with edges between random variables represented by nodes

[6]. The GSP literature, on the other hand, focuses on algebraic and spectral characteristics of the graph signals [7, 8, 9], which are defined as observations on a collection of nodes. The GSP-based graph learning methods (see [10, 11] for two recent reviews) further fall into two distinct branches, i.e. those based on the diffusion processes on graphs [12, 13, 14, 15] and those based on smoothness measures of graph signals [16, 17, 18, 19, 20]. Very recently, GNNs have attracted a surging interest in the machine learning community which leads to a number of approaches to graph inference [21, 22].

While many of the above methods can effectively learn a meaningful graph from observations, there is a lack of consideration of the additional information, i.e. node-side or observation-side covariates, which may be available for the task at hand. Those covariates that provide valuable side information should be integrated into the graph learning framework. Taking an example of measuring temperature records in different locations in a country, where nodes represent weather stations, the latitude, longitude and altitude of each station are useful node-side information. One major benefit is to lessen the reliance of the above models on the quality of the observations. Heavily corrupted or even missing records can be predicted from the relationship between the observations and the side information, which in turn helps improve the efficiency in graph inference.

Furthermore, although node-side dependency is inherently accounted for in the process of graph learning, the observation-side dependency is largely ignored in the literature. One example are temperature records collected at different timestamps, which could largely affect the evaluation of the strength of relation between stations. Another example is that of a recommender system, where the item ratings collected from different individuals are largely affected by the social relationship between them.

To tackle the above issues, we revisit the graph signal observations from a functional viewpoint and propose a framework for learning undirected graphs by considering additional covariates on both the node- and observation-side. This allows us to capture dependency structure within the graph signals which leads to more effective graph and signal recovery. More specifically, as shown in Figure 1, the -th entry of the graph-structured data matrix , which contains graph signals collected on nodes, can be viewed as some potentially noisy or missing observation of , i.e. the -th function evaluated at -th node. To model the node-side information, we introduce a covariate

that can explain the variations in a graph signal, e.g. a vector that contains the latitude, longitude and altitude of stations in the aforementioned temperature example. To model the observation-side information, we also introduce a generic covariate

. For example, could be the timestamp at which the temperature record is collected. Observation-side dependency hence arises due to depending on . Combining the two, the function underlying the graph signals takes the form of .

Specifically, we define the function in a reproducing kernel Hilbert space (RKHS) with a product kernel on . At the same time, the two-side dependency in is encoded in a Kronecker product of two graph Laplacian matrices , where represents the connectivity between nodes to be learned and represents the observation-side dependency, essentially a nuisance dependency for the graph learning problem. We assume can be captured by evaluating at the observation-side covariates . Our key contribution is the Kernel Graph Learning (KGL) framework, which allows us to infer by jointly learning the function and optimising for a novel Laplacian quadratic form that effectively expresses the smoothness of over .

In addition, we provide several extensions of KGL for the scenario of a partially observed with known missing value positions, and that of observations without either node-side or observation-side information. The learning problem is effectively solved via a block coordinate descent algorithm, which has a theoretical guarantee of convergence. We show that KGL can effectively recover the groundtruth graph from the two-side dependent data and outperform the state-of-the-art smoothness-based graph learning methods in both synthetic and real-world experiments.

In summary, the main contributions of our work are as follows:

  • A novel graph-based regularisation based on a smoothness measure of dependent graph signals over the Kronecker product of two graph Laplacian matrices;

  • A graph learning framework that integrates node- and observation-side covariates from a functional viewpoint;

  • An efficient method for denoising and imputing missing values in the observed graph signals as a byproduct of the graph learning framework.

Ii Related Work

In this section, we survey the classical methods of learning a graph from a number of different perspectives. From each perspective, we highlight the most related work that considers one or more aspects of the 1) node-side information, 2) observation-side dependency, and 3) noisy and missing data.

Ii-a -Nearest Neighbours Methods

The -nearest neighbours (-NN) connects a node to

other nodes with the smallest pairwise distances in terms of the observations. It is flexible with different choices of distance metrics and yet heuristic since the neighbourhood search is based on pairwise comparison of observations on nodes. The majority of the

-NN variants focuses on fast approximate search algorithms (ANN) [4, 2, 3, 5]

and recent variants apply deep reinforcement learning to explicitly maximise search efficiency

[23, 24]. By comparison, the model-based methods, e.g. PGMs and GSP, directly integrate global properties of the observations into learning objectives.

Ii-B Probabilistic Graphical Models

In the field of PGMs, the inverse covariance matrix

is often regarded as an undirected graph that parameterises the joint distribution of random variables representing nodes. There is a rich literature on effective algorithms to estimate a sparse

from Gaussian-distributed data by solving an

-regularised log-likelihood maximisation [25, 26, 27, 28, 29], including the widely used graphical Lasso [30] and G-ISTA [31]. The recent state-of-the-art algorithm BigQUIC [32] scales to millions of nodes with performance guarantees. Besides computational improvements, models based on attractive Gaussian Markov Random Fields (GMRFs) [33, 34, 35, 36] further restrict the off-diagonal entries of to be non-positive, which is equivalent to learning the Laplacian matrix of the corresponding graph with non-negative edge weights. The most related extensions of the graphical Lasso were proposed in [37, 38], which simultaneously learn two dependency structures in the matrix-variate Gaussian data. While their work focuses on estimating covariance matrices, our work focuses on recovering a graph topology from data.

Ii-C Structural Equation Models

Structural equation models (SEMs) are another type of models (similar to PGMs) that is widely used to learn a directed acyclic graph (DAG) that encodes the conditional dependence of random variables [39, 40, 41]. Based on SEMs, the authors in [42] proposed a block coordinate descent algorithm to solve the joint optimisation problem of denoising the data and learning a directed graph. The joint learning framework is further extended to time series [43]

, where the structural vector autoregressive models (SVARMs) replace the linear SEMs to handle temporal dependency. The main difference from our work is that their denoising function is an identity mapping without side information as covariates. The work in

[41] also considers the temporal dependency in learning a DAG with SVARMs, but does not consider the denoising scenario.

Ii-D Graph Signal Processing

In the context of GSP, every observation on a collection of nodes is defined as a graph signal. GSP-based graph learning models have seen an increasing interest in the literature [10, 11] and further fall into two distinct branches. The first branch assumes graph signals are outcomes of diffusion processes on graphs and reconstructs a graph from signals according to the diffusion model [12, 13, 14, 15]. The other branch constructs a graph by promoting the global smoothness of graph signals, which is defined by the Laplacian quadratic form [16, 17] or more generally via total variation [20]. Smoothness-based methods are related to GMRFs by recognising that the Laplacian quadratic form is closely related to the log-likelihood of the precision matrix defined as the graph Laplacian. Our work can be regarded as an extension to smoothness-based graph construction.

In the literature of smoothness-based GSP graph learning, the authors in [17, 20] adopt a two-step learning algorithm to learn an undirected graph while denoising graph signals. They simply assume an identity mapping between the actual graph signals and noisy observations, which is different from our work that considers side information. The most related work is proposed in [44]

, which uses kernel ridge regression with observation-side covariates to infer graph signals. However, their work mainly focuses on data prediction and graph learning is only a byproduct in their approach. In Section

VI-A, we will show, both theoretically and empirically, that their method uses a smoothness term that imprecisely incorporates the observation-side dependency in the learned graph structure, leading to an inferior performance in learning a graph.

In terms of the observation-side dependency, there exist some GSP graph learning models that consider temporal dependency in graph signals. A so-called spatiotemporal smoothness was proposed in [45, 46] to transform the graph signals using a temporally weighed difference operator. If every timestamp is equally important, the operator is equivalent to a prepossessing step to make the time series observed on each node stationary. It should be noted that there is another branch of research assuming that the temporal dependency in graph signals originates in the dynamic changes in the edges [47, 48], and therefore the problem is formulated as learning a dynamic series of graphs, which is different from the goal of our paper.

Ii-E Graph Neural Networks

A new branch of graph learning models is developed from the perspective of GNNs. Essentially, GNNs discover the patterns in graph-structured data in a hierarchical manner [49, 50, 51]. The activations at intermediate layers, e.g. the -th layer, can be interpreted as a new representation for the nodes in the embedding space that incorporates the information from a specifically defined neighbourhood of the nodes. The authors in [52, 21] thus defined the strength of connectivity between nodes and based on the pairwise similarity of their embeddings and at the -th layer of the GNN architecture. The authors in [53] extended this method to construct a directed graph in the process of training a GNN that deals with time series data. The main goal of these methods is to improve the performance of node-related tasks (e.g. classification or prediction) and graph learning is only a byproduct, whose performance is often not guaranteed. The recent works in [54, 55, 56, 22] start to incorporate an additional loss for recovering graphs while training the GNNs. However, a significant limitation of most GNN-based methods is that they typically require a large volume of training data and the learned connectivity is often less explainable compared to PGM and GSP methods.

Iii Preliminaries

Iii-a Smoothness-Based GSP Graph Learning

Observing a data matrix whose -th entry corresponds to the observation on the -th node in the -th graph signal, we are interested in constructing an undirected and weighted graph . The node set represents a collection of variables, where . The edge set represents the relational structure among them to be inferred. The structure is completely characterised by the weighted adjacency matrix whose -th entry is . If two nodes and are connected by an edge then , else if then . The graph Laplacian matrix is defined as , where denotes the all-one vector. and are equivalent and complete representations of the graph on a given set of nodes.

In the literature of GSP, one typical approach of constructing a graph from is formulated as minimising the variation of signals on graphs as measured by the Laplacian quadratic form111We acknowledge that the conventional form of the Laplacian quadratic in GSP literature is , where each column of corresponds to a graph signal. In our case, has two-side dependency such that either a column or a row may be regarded as a graph signal. The term measures the smoothness of row vectors over a column graph. This formulation is however consistent with the statistical modelling convention where each column in is often regarded as a random variable and the graph of main interest is the column graph. [7, 16, 17]:

(1)

where defines the space of valid Laplacian matrices, and

is a regularisation term with a hyperparameter

. Equivalently, the problem can be formulated using the weighted adjacency matrix such that

(2)

where defines the space of valid weighted adjacency matrices. Popular choices of regularisation include the sum barrier and (often added as a constraint such that ) to prevent trivial solutions where all edge weights are zero and meanwhile controlling the variations of edge weights [17], or the log-barrier to prevent isolated nodes and promote connectivity [16].

With a fixed Frobenius norm for , a small value of the objective in Eq.(1) implies that is smooth on in the sense that neighbouring nodes have similar observations. The authors in [17] further propose a probabilistic generative model of the noise-free smooth observations such that

(3)

where denotes the Moore-Penrose pseudo-inverse of a matrix. This leads to a graph learning framework which solves an optimisation problem similar to Eq.(1).

Iii-B Kronecker Product Kernel Regression

Taking a functional viewpoint on the generation of graph-structured data matrix, we can make use of the well-studied formalism of Kronecker product kernel ridge regression to infer the latent function [57]. Specifically, we consider to be an element of a reproducing kernel Hilbert space (RKHS) corresponding to the product kernel function on , where denotes the Kronecker product.

A kernel function can be expressed as an inner product in a corresponding feature space, i.e. where and , where . An explicit representation of feature maps and is not necessary and the dimension of mapped feature vectors could be high and even infinite. By the representer theorem, the function that fits the data takes the form

(4)

where are the coefficients to be learned, and the estimated value . Denoting the corresponding kernel matrices as and , where and , we have the matrix form

(5)

where is an approximation to and the coefficient matrix has the -th entry as . We assume the observation is a noisy version of , where the noise is normally distributed random variables such that , where

measures the noise level. This leads to a natural choice of the sum of squared errors as the loss function.

A standard Tikhonov regulariser is often added to the regression model to reduce overfitting and penalise complex functions, which is defined in our case as

(6)

where is the vectorisation operator for a matrix. We now arrive at the following optimisation problem to infer the function that approximates the observation matrix such that

(7)

where the hyperparameter controls the penalisation of the complexity of the function to be learned.

To have a better understanding of how this model is expressive for the two-side dependency, we show that the objective in Eq.(7) can be derived from a Bayesian viewpoint. In the vector form, i.e. and , we assume that both the data likelihood and the prior follow a Gaussian distribution:

(8a)
(8b)

where is a zero-vector of length . Notice that and (and their Kronecker product) can be either singular or non-singular matrices, depending on the kernel choice. For the sake of simplicity, we use the pseudo-inverse notation throughout the paper. Now, the marginal likelihood of is

(9)

from which we can see that the covariance structure of can be understood as a combination of and . Specifically, in the noise-free scenario where , the covariance matrix over the rows of is

(10)

Similarly, the covariance matrix over the columns of is

(11)

The proof for Eq.(10) and Eq.(11) can be found in [58]. In Appendix A, we show that the maximisation of the log-posterior of the coefficient vector leads to the objective in Eq.(7).

Iv A Smoothness Measure for Dependent Graph Signals

From Eq.(9), a noise-free version of has a Kronecker product covariance structure . Recall that, for Gaussian graph signals, the covariance is often modelled as the pseudo-inverse of the graph Laplacian matrix (see Eq.(3) and [17, 33]), which is used for measuring the signal smoothness. Inspired by this observation, we define a notion of smoothness for the graph signals with two-side dependency using the Laplacian quadratic form as follows

(12)

where can be interpreted as a Laplacian-like operator with a Kronecker product structure. To see this more clearly, first, let us define as an undirected weighted column graph that represents the structure among column vectors, and correspondingly as a row graph that represents the structure among row vectors. Second, let us connect the kernel matrices and to the Laplacian matrices and by recognising that the former can be defined as functions of the latter as kernels on graphs [59], e.g.

Therefore, we have , and we further show in Appendix B that is a Laplacian-like operator on which the notion of frequencies of can be defined.

In practice, the observation-side dependency is often given or easy to obtain. For example, for graph signals with temporal Markovian dependency, is often modelled as a path graph representing that the observation at time only depends on the observation at time . By comparison, is the primary variable of interest that is often estimated in the graph learning literature. Therefore, in this paper, we assume that can be encoded in the observation-side information via such that . We simply denote from this section onwards and define a smoothness measure where we replace the kernel matrix in Eq.(12) with the Laplacian matrix

(13)

This effectively disentangles the relationship among nodes (i.e. the graph to be learned) from the observation-side dependency in graph signals.

The smoothness term, in the vectorised form, can be viewed as a Laplacian regulariser which can be added to the problem of inferring a function that fits the graph signals in Eq.(7). Specifically, the graph signals in the smoothness term can be replaced with the estimates from the function such that

(14)

where denotes a compact manifold222We refer the interested reader to [60, 61] for the theorem of manifold regularisation with the Laplace-Beltrami operator.. We will make use of the Laplacian regulariser in Eq.(LABEL:eq:_f_smoothness) to derive the proposed graph learning models in the following section.

V Kernel Graph Learning

V-a Learning Framework

We propose a joint learning framework for inferring the function that fits the graph signals as in Eq.(7) as well as the underlying graph to capture the relationship between the nodes as in Eq.(LABEL:eq:_f_smoothness). This relationship is disentangled from the observation-side dependency of non- graph signals with the notion of smoothness introduced in Section IV. We name this framework Kernel Graph Learning (KGL) which aims at solving the following problem:

(15)

where , and denotes the Frobenius norm. The first two terms correspond to the functional learning part where the hyperparameter controls the complexity of the function for fitting . The last two terms and the constraints can be viewed as a graph learning model in Eq.(1) with the fitted values of as input graph signals and the sum barrier as the graph regulariser. The hyperparameter controls the relative importance between fitting the function and learning the graph, and controls the distribution of edge weights. The trace constraint acts as a normalisation term such that the sum of learned edge weights equals the number of nodes. The model is compatible with constraints that enforce other properties on the learned graph, e.g. the log barrier introduced in Section III-A. This paper is mainly based on one of the choices for the constraints in order to maintain focus on the general framework.

V-B Optimisation: Alternating Minimisation

We first recognise that Eq.(15) is a biconvex optimisation problem, i.e. convex w.r.t while is fixed and vice versa. This motivates an iterative block-coordinate descent algorithm that alternates between minimisation in and [62, 63]. In this section, we derive the update steps of and separately, propose the main algorithm in Algorithm 1, and prove its convergence.

V-B1 Update of

The update of coefficients can be regarded as solving a Laplacian-regularised kernel regression [61]. Given , the optimisation problem of Eq.(15) becomes

(16)

and, after dropping constant terms,

(17)

Denote and , we obtain a dual-form for such that

(18)

where , and for simplicity. We prove in Appendix C that is positive semi-definite thus Eq.(18) is an unconstrained quadratic programme. The gradient of w.r.t. is

(19)

Strictly speaking, the matrix

may contain zero eigenvalues which makes it not invertible. However, a majority of popular kernel functions for

and are positive-definite, e.g. the RBF kernel. Since Kronecker product preserves positive definiteness, and hence the whole matrix is positive-definite and invertible. Setting and cancelling out , we have:

(20)

Denote , where has a dimension of . We can therefore obtain a closed-form solution such that

(21)

where the inverse of requires .

To further reduce the complexity, we make use of the Kronecker structure and matrix tricks. We first recognise as

where is the Kronecker sum. With the eigendecomposition , and , we have

(22)

Here, is an diagonal matrix with entries being all the pairwise sums of eigenvalues in and . Inversion of that matrix is thus . We can now obtain cheap inversion with

(23)

The operation is simply rescaling each term in the -vector with the corresponding diagonal entry of . If we denote and column vectors containing the diagonal entries this can be expressed as with

where denotes the Hadamard product and denotes entrywise inversion. Remaining operations are now direct and give the closed-form solution as

Notice that this solution requires only matrix multiplications and inversions and eigendecompositions on or matrices, giving an overall computational cost of .

When and are not invertible, we suggest using the gradient descent to avoid the inverse of a large matrix of dimension . The update step using Eq.(19) is:

(25)

where is the learning rate

V-B2 Update of

Given , the optimisation problem of Eq.(15) becomes

(26)

which is a constrained quadratic programme w.r.t. . By taking , the problem fits in the learning framework in Eq.(1). We use the package CVXPY [64] to solve this problem.

0:  Observation , node-side kernel matrix , observation-side kernel matrix , hyper-parameters , and , tolerance level .
1:  Initialisation:
2:  while  and  do
3:     Update by solving Eq.(26)
4:     Update by (i) using the closed-form solution in Eq.(LABEL:eq:update_a_vec_closed_kron), if and are invertible; or (ii) updating with gradient descent in Eq.(25), otherwise
5:     
6:  end while
7:  return  ,
Algorithm 1 Kernel Graph Learning (KGL)

The overall KGL framework is presented in Algorithm 1. The convergence for each update step of and is guaranteed in solving the respective convex optimisation of Eq.(17) and Eq.(26). It should be noted that the step size in Eq.(25) needs to be set appropriately for the gradient descent to converge. We suggest a from empirical results. We now prove the following lemma.

Lemma V.1.

The sequence generated by Algorithm 1 converges monotonically and the solution obtained by Algorithm 1 is a stationary point of Eq.(15).

Proof.

We follow the convergence results of the alternate convex search in [63] and that of a more general cyclic block-coordinate descent algorithm in [62]. By recognising Eq.(17) and Eq.(26) are quadratic programmes (with Lemma C.1 in Appendix C), the problem of Eq.(15) is a bi-convex problem with all the terms differentiable and the function continuous and bounded from below. Theorem 4.5 in [63] states that the sequence generated by Algorithm 1 converges monotonically. Theorem 4.1 in [62] states that the sequence and generated by Algorithm 1 are defined and bounded. Furthermore, according to Theorem 5.1 in [62], every cluster point is a coordinatewise minimum point of hence the solution is a stationary point of Eq.(15).

Our empirical results suggest that after only 10 iterations or less, the sequence does not change more than the tolerance level. The computational complexity of KGL in Algorithm 1 is dominated by the step of updating . It requires to compute the closed-form solution of or to compute the gradient in Eq.(19) if the closed-form solution is not applied when and not invertible. Updating requires . Overall, for iterations that guarantee the convergence of Algorithm 1, it requires either or operations, depending on whether and are chosen to be invertible. We note that one could readily appeal to large-scale kernel approximation methods for further reduction of computational and storage complexity of the KGL framework, and hence broaden its applicability to larger datasets. There are two main approaches to large-scale kernel approximations and both can be applied to KGL. The former focuses on kernel matrix approximations using methods such as Nyström sampling [65], while the latter deals with the approximation of the kernel function itself, using methods such as Random Fourier Features [66]. Nonetheless, this paper focuses on the modelling perspective, and we will leave the algorithmic improvement as a future direction.

V-C Special Cases of Kernel Graph Learning

Independent observations

It is often assumed that graph signals are hence there exists no dependency along the observation side. This is equivalent to setting in our framework. We refer to this special case of KGL as Node-side Kernel Graph Learning (KGL-N):

(27)
No node-side information

It also may be the case that no node-side information is available for the problem at hand. In this case, we can simply set in KGL, leading to to Observation-side Kernel Graph Learning (KGL-O):

(28)

In the cases of one-side KGL, the optimisation again follows the alternating minimisation, i.e. to solve for KGL-N or KGL-O), one can simply set or in Algorithm 1. It should be noted, however, that the update step of requires less computational cost when either or . Indeed, the objective function of KGL-N can be decomposed into the sum according to functions such that

(29)

where and . Consequently, the update step can be parallelised.

V-D Learning with Missing Observations

By modifying the least-squares loss in KGL, we propose an extension to jointly learn the underlying graph and function from graph-structured data with missing values. We encode the positions of missing values with a mask matrix such that if is missing, and otherwise. Now, we only need to minimise the least-squares loss over observed in the functional learning part, which leads to the formulation:

(30)

This formulation also applies to one-side kernel graph learning, i.e. KGL-N or KGL-O, with or .

The optimisation problem in Eq.(30) is a bi-convex problem and alternating minimisation can still be applied. The update step of remains the same as in Eq.(26), but the gradient in the update step of (Step 4. in Algorithm 1) becomes

(31)

where . The detailed derivation of the gradient is provided in Appendix D. We further assume is invertible, which is a mild assumption as we have many choices of kernel functions for and to be invertible. Also noting , the gradient becomes

(32)

One can either derive a close-form solution or use gradient descent based on Eq.(32).

Vi Synthetic Experiments

Vi-a General Settings

Groundtruth Graphs

Random graphs of nodes are drawn from the Erdös-Rényi (ER), Barabási-Albert (BA) and stochastic block model (SBM) as groundtruth, which are denoted as , and

, respectively. The parameters of each network model are chosen to yield an edge density of 0.3. The edge weights are randomly drawn from a uniform distribution

. The weighted adjacency matrix is set as for symmetry and normalised such that the sum of edge weights is equal to for ease in comparison. The graph Laplacian is calculated from .

Groundtruth Data

We generate mild noisy data from , where is drawn from according to Eq.(8b). Every entry of the noise matrix is an sample from . To test the proposed model against different levels of noises, we vary the value of in Section VI-B. For all other synthetic experiments, we add a mild-level noise with . We choose , as it is a popular method to generate smooth signals in related work [16, 44]. We consider both dependence and independence along the observation side:

  • Independent Data: ;

  • Dependent Data: is obtained from an RBF kernel evaluated on synthetic observation-side information

    , which can be interpreted as the time-stamps of a discrete-time Markov chain. The bandwidth parameter is chosen according to the median heuristic

    [67].

Model Candidates

To have a fair comparison, the models are divided into two groups. The first group contains the baseline models that cannot deal with observation-side dependence:

  • GL (Eq.(14) in [16]): the GSP graph learning model in Eq.(1) with .

  • GL-2step (Eq.(16) in [17]): a two-step GSP graph learning framework with an identity mapping as denoising function. From a modelling perspective, Eq.(13) in [20] proposed a similar model with more constraints on edge weights and a different optimisation algorithm. We treat them as the same kind of techniques.

  • KGL-N (proposed model in Eq. (27)): in KGL.

The second group is examined with observation-side dependent data:

  • KGL-Agnostic (Eq.(18) in [44]): As discussed in Section II, the joint learning model in [44] considered the observation-side kernel, but did not use the observation-side dependence in learning the graph. We denote their model as KGL-Agnostic with our notations:

    (33)
  • KGL (proposed model in Eq. (15)): the main learning framework.

  • KGL-O (proposed model in Eq.(28)): To have a fair comparison to KGL-Agnostic, we also assume the graph is agnostic to the model (i.e. as model input).

For each method, we determine the hyperparameters via a grid search, and report the highest performance achieved by the best set of hyperparameters.

Evaluation Metrics

Average precision score (APS) and normalised sum of squared errors () are used to evaluate the graph estimates, and out-of-sample mean squared error () is used to evaluate the estimated entries of graph-structured data matrix that were not observed (or missing). The APS is defined in a binary classification scenario for graph structure recovery, which automatically varies the threshold of weights above which the edges are declared as learned edges. An APS score of 1 indicates that the algorithm can precisely detect the ground-truth edges and non-edges. The is defined over learned adjacency matrix and the groundtruth adjacency matrix as

The out-of-sample of data matrix is defined with a mask matrix (same as in Eq.(30)), where indicates is a missing entry:

where and is obtained from model estimates. Similarly, we are interested in the training for analysing overfitting:

Vi-B Learning a Graph from Noisy Data

In order to evaluate the performance of the proposed model in learning a graph from noisy data, we add noise to the groundtruth data such that , where every entry of the noise matrix is an sample from . We vary the noise level

from 0 to 2 against which we plot the evaluation metrics in Figure

2. Under the same settings of the noise level, random graph and model candidate, we repeat the experiment for 10 times and report the mean (the solid curves) as well as the 5th and 95th percentile (the error bars) of the evaluation metrics.

From Figure 2, Figure 8 and Figure 9 (the latter two in Appendix E-A), the proposed models outperform the baseline models in terms of all evaluation metrics. Specifically, for , when the data are independent, the performance of KGL-N drops slowly as the noise level increases, while that of GL and GL-2step drops quickly when noise level goes above 0.5. It is worth mentioning that the curves of two almost overlap in terms of APS. This indicates the identity mapping in GL-2step as denoising function does not help much in recovering graph structure, although it yields slightly smaller SSE than GL with the noise level greater then 0.75.

, independent data
, independent data
, dependent data
, dependent data
Fig. 2: The performance of recovering groundtruth graphs in terms of APS and SSE from independent data (1st row) and dependent data (2nd row) with different noise levels.

For the dependent data, the proposed model KGL achieves a high performance when the noise level is less than 0.75. Even without the node-side information as model input in KGL (note that the groundtruth data are generated in a consistent way in [44] proposing KGL-agnostic), the proposed method (KGL-O) can still learn a meaningful graph with slightly worse performance compared to KGL. By contrast, KGL-agnostic cannot recover the groundtruth graph to a satisfying level even with little noise (, as its smoothness term does not capture the dependence structure on the observation side.

From Figure 8 and Figure 9, we see that is slightly more difficult to recover from data, but an improvement can nevertheless be seen in the proposed models from the baselines when the noise level is low.

Vi-C Learning a Graph from Missing Data

, independent data
, independent data
, dependent data
, dependent data
Fig. 3: The performance of recovering groundtruth graphs in terms of APS and SSE from independent data (1st row) and dependent data (2nd row) with different rates of missing values in .

To examine the performance of learning a graph from incomplete data with described in Section V-D, we generate the mask matrix indicating missing entries, where and is the missing rate, i.e.

has a probability of

to be missing and has a value of 0. The preprocessed data with 0 replacing missing entries is , which is a natural choice in practice for the model candidates that cannot directly deal with missing entries, as the mean value of the entries in is 0 by design. We use as the input in the baseline models GL and GL-2Step. For KGL-Agnostic in Eq.(33), we also add a mask matrix in the least-squares loss term to have a fair comparison.

We vary from 0 to 0.9, against which we plot the evaluation metrics, APS and SSE, in Figure 3. The plots for and are in Appendix E-B. Similar to the noisy scenario, we repeat the experiments 10 times under the same settings of missing rate, random graph and model candidate and report the mean (the solid curves) as well as the 5th and 95th percentile (the error bars) of the evaluation metrics.

For , the proposed methods KGL and KGL-N can recover the groundtruth graphs reasonably well even when there are 80% missing entries. For the independent data scenario, the performance of the baseline models with the preprocessed data drops steeply as the missing rate increases. Although it can still recover the groundtruth graphs with a high APS and low SSE when the missing rate is less than 20%, the performance is not as good as KGL-N. By contrast, for KGL-N, the APS only drops by 0.1 from no missing entries to around 90% missing entries, while SSE only increases by around 0.05.

For the dependent data scenario, all three model candidates can deal with missing data directly by adding a mask matrix in the least-squares loss term. Consequently, their curves are relatively stable as the missing rate increases from 0 to 80%. However, the accuracy levels at which each of the model stabilises are different. The proposed method KGL, aware of both node-side and observation-side information, achieves the highest APS and lowest SSE. By contrast, KGL-O, without access to the node-side information, can still recover a meaningful graph but with less correct edges and less accurate edge weights when the missing rate is less than 0.6. The performance of KGL-Agnostic is not as good as KGL-O because the smoothness term in Eq.(33) is not able to disentangle the influence of the observation-side dependency from the graph structure.

The performance of KGL does not differ much for different random graph models. Still, there is an improvement in recovering compared to the baseline graph learning models, as can be seen in Figure 11 and Figure 11 in Appendix E-B.

Vi-D Graph-Structured Matrix Completion

independent data
dependent data
Fig. 4: The MSE of recovering missing entries in . The results show the mean of 30 random experiments of 3 different types of graphs, i.e. 10 for each graph type.

In the experiment of learning a graph with incomplete data matrix in Section VI-C, we are also interested in the performance of matrix completion. In Figure 4, we plot the MSE of the missing entries (i.e. the out-of-sample MSE) that is averaged over all types of graphs against the varying missing rate. The proposed methods lead to much smaller errors compared to baseline models, for both independent and dependent data. It should be noted that GL does not offer a mechanism for inferring missing data, hence is not included in this experiment.

Vi-E Learning a Graph of Different Sizes

The graph learning performance of the proposed method varies with the size of graphs and number of observations. As shown in Figure 5, a hundred observations are sufficient to recover a small graph with nodes with a high accuracy. When the graph size increases, the number of the observations required for KGL

to achieve a high APS increases roughly exponentially. On the other hand, when the number of observations increases, the variance of APS of the learned graphs decreases.

Fig. 5: The performance of learning graphs of different sizes against varying number of observations with the proposed model KGL. The results show the mean of 30 random experiments of 3 different types of graphs, i.e. 10 for each graph type.

Vi-F Impact of Regularisation Hyperparameters

Three hyperparameters are involved in and its variants. As introduced in Section V, controls the complexity of the functional learning and prevents overfitting; controls the relative importance of graph learning compared to functional learning, and at the same time determines the smoothness of the predicted data over ; Finally, controls the Frobenius () norm of the graph Laplacian which, together with the trace () constraint, bears similarity to an elastic net regularisation [68]. The larger the , the less sparse the graph with more uniform edge weights. The accuracy in learning the graph Laplacian and inferring the data matrix is determined by the combination of these three hyperparameters, which is not straightforward to visualise and analyse at the same time. Fortunately, we may still gain some insights by examining their distinct effects separately.

Firstly, should be chosen according to the prior belief of the graph sparsity defined as the number of edges with non-zero weights. As shown in Figure 12 (in Appendix E-C), when , the learned graph contains only a few most significant edge. Due to the constraint on the sum of edge weights, i.e. , the total weights are allocated to a few significant edges when is small. When , the learned graph becomes fully connected with equal edge weights. In the synthetic experiment where we have knowledge of the groundtruth graph, the best accuracy is obtained when the sparsity coincides with the groundtruth graph.

The value of , on the other hand, has little effect on the accuracy of predicting the missing entries in , as the update step of does not involve the term with . As shown in Figure 13(a)-(d), the out-of-sample MSE is determined by the combination of and . We first notice that the error is the same when , showing that we overly penalise the function complexity in this case. Indeed, when , the function is overly smooth such that the entries of the coefficient matrix are all zero leading to the entries of prediction being all zero as well. This also happens when , where the vector form of the prediction is forced to be overly smooth on . In particular, when , every row vector of (i.e. the predicted graph signal) has constant entries, as a result of minimising the term to zero. On the other hand, when , all the entries in has constant values such that this term is minimised to zero.

The ranges of values of and for which the out-of-sample MSE is the smallest generally coincide with that for which the APS is the largest in Figure 13 (in Appendix E-C), e.g. when , and . However, the out-of-sample MSE is generally small when and . This is understandable as the function could be very complex with little penalisation, but this does not guarantee good performance in recovering the groundtruth graph.

Vii Real-World Experiments

Vii-a Swiss Temperature Data

In this experiment, we test the proposed models in learning a meteorological graph of 89 weather stations in Switzerland from the incomplete temperature data333The data are obtained from https://www.meteoswiss.admin.ch/home/climate/swiss-climate-in-detail/climate-normals/normal-values-per-measured-parameter.html.. The raw data matrix contains 12 rows representing the temperatures of 12 months that are averaged over 30 years from 1981 to 2010 and 89 columns representing 89 measuring stations. The raw data are preprosessed such that each row has a zero mean.

To have a fair comparison, we deliberately omit a portion of the data as input for the model candidates and treat as groundtruth the learned graph obtained from GL using the complete 12-month data. Specifically, we only use the first three-month temperature records (i.e. the first three rows) to learn a graph. To test the performance in missing scenario, we further generate the mask matrix with different rates of missing values, as described in Section VI-C, and apply them to the three-month data. Similar to the synthetic experiment, we choose the hyperparameters in models that yield a graph density of 30% (i.e. keeping 30% most significant edges) to make the learned graphs comparable.

The altitude of a weather station is a useful node-side information for predicting temperature and learning a meteorological graph. Therefore, the corresponding kernel matrix is obtained from an RBF kernel evaluated at the altitudes of each pair of weather stations for use in KGL and KGL-N. Monthly time-stamps correspond to the known observation-side information. The bandwidth parameter is chosen according to the median heuristic [67]. As described in Section VI-A, is thus obtained from an RBF kernel evaluated at three time-stamps for three-month graph signals and used as input in KGL, KGL-Agnostic and KGL-O. For GL and GL-2step, no side information is used. The hyperparameters are tuned via a grid search and highest performance achieved by the best set of hyperparameters is reported. For the real-world scenario where a groundtruth graph is not easy to obtain, the hyperparameter can be chosen according to the results in Section VI-F.

APS
SSE
Fig. 6: The performance of learning a meteorological graph of 89 Swiss weather stations from the incomplete temperature data with various missing rates.

We present the results in Figure 6. In terms of both APS and SSE, KGL and KGL-N outperform the other candidates. This indicates that altitude is a reliable node-side covariate with which we can learn a meaningful meteorological graph despite a small number of signals. When there are more missing values, KGL, with the known temporal information, slightly outperforms KGL-N. However, since we only have three-month signals, the temporal information is less predictive. Since the groundtruth graph is learned from GL with 12-month signals, the recovering ability of GL and GL-2step is not far behind when there is no missing values in three-month data, but drops sharply with an increasing missing rate. Compared to KGL-O, the poor performance of KGL-Agnostic indicates that the imprecise smoothness term in Eq.(33) is the main reason that we cannot recover an annual meteorological graph with only three-month temperature records, as both of them are agnostic to the node-side information.

Vii-B Sushi Review Data

In this experiment we will evaluate the performance of our proposed methods by comparing the recovered graphs with groundtruth using the Sushi review data collected in [69]. The authors tasked 5000 reviewers to rate 10 out of 100 sushis randomly with a score from 1 (least preferred) to 5 (most preferred); reviews for each sushi are treated as one graph signal in this experiment. For each reviewer, we have 10 descriptive features which cover demographical information about the reviewers, such as age, gender and the region the reviewer currently lives in. We also have 7 attributes describing each sushi, including its oiliness in taste, normalised price and its grouping (for example, red-meat fish sushi, white-meat fish sushi or shrimp sushi). We will treat the grouping attribute as the underlying groundtruth label for each sushi and not use it in the KGL algorithm.

We will consider 32 sushis from 5 sushi groups, namely red-meat (7 sushis), clam (6 sushis), blue-skinned fish (8 sushis), vegetable (6 sushis) and roe sushi (5 sushis). These are treated as the groundtruth labels. Our goal will be to recover a graph of sushis which contains clusters corresponding to these group labels (while omitting the group attribute from the node-side information, i.e. we retain only 6 remaining attributes).

We pick an increasing number of reviewers at random for our experiment. This is to demonstrate how the algorithm performs under different number of signals. After preprocessing, we arrive to a data matrix with 32 columns, each representing a type of sushi, and rows representing each reviewer’s rating to the sushis. This is a sparse matrix with an average sparsity of . We run KGL, KGL-N, KGL-O, GL, GL-2step and KGL Agnostic to obtain a graph of sushis. To evaluate the result quantitatively, we compute the normalised mutual information

(NMI) between the cluster assignments obtained by applying spectral clustering

[70] to the recovered graphs and the underlying sushi grouping. NMI is used to measure the agreement between two grouping assignments, 0 indicating no mutual information while 1 indicating perfect correlation. To emphasise that node-side attributes alone are not sufficient to recover the groundtruth label, we also applied clustering on the RBF graph obtained from remaining six sushi attributes, which resulted in an NMI score of 0.34.

Fig. 7: Agreement between cluster assignments on learned sushi graphs and the withheld sushi group attributes. Graphs are learned from incomplete sushi review data.

Figure 7 illustrated our results. KGL was the best performer, followed by KGL-N, and both significantly outperformed KGL-Agnostic, KGL-O, GL-2step and GL. Moreover, both KGL and KGL-N outperformed the case where we solely use the RBF Graph. The results demonstrate the merit of our proposed methods in incorporating side information for graph recovery, in particular the observation-side information (i.e. the reviewers’ information) to capture the dependency between the observed signals.

Viii Conclusion

In this paper, we have revisited the smooth graph signals from a functional viewpoint and proposed a kernel-based graph learning framework that can integrate node-side and observation-side covariates. Specifically, we have designed a novel notion of smoothness of graph signals over the Kronecker product of two graph Laplacian matrices and combined it with a Kronecker product kernel regression of graph signals in order to capture the two-side dependency. We have shown the effectiveness and efficiency of the proposed method, via extensive synthetic and real-world experiments, demonstrating its usefulness in learning a meaningful topology from noisy, incomplete and dependent graph signals. Although we have proposed a fast implementation exploiting the Kronecker structure of kernel matrices, the computational complexity remains cubic in the maximum of the number of nodes and the number of signals. Hence, a natural future direction is to further reduce the computational complexity with the state-of-the-art methods for large-scale kernel-based learning, such as random Fourier features. Another interesting direction is to develop a generative graph learning model based upon the framework presented here, using connections between Gaussian processes and kernel methods.

References

  • [1] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on freebase from question-answer pairs,” in

    Proceedings of the 2013 conference on empirical methods in natural language processing

    , 2013, pp. 1533–1544.
  • [2] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Proc. - Annu. IEEE Symp. Found. Comput. Sci. FOCS, vol. 51, no. 1, pp. 459–468, 2006.
  • [3] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” VISAPP 2009 - Proc. 4th Int. Conf. Comput. Vis. Theory Appl., vol. 1, pp. 331–340, 2009.
  • [4] J. L. Bentley, “Multidimensional Binary Search Trees Used for Associative Searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975.
  • [5] W. Dong, M. Charikar, and K. Li, “Efficient K-nearest neighbor graph construction for generic similarity measures,” Proc. 20th Int. Conf. World Wide Web, WWW 2011, pp. 577–586, 2011.
  • [6] S. L. Lauritzen, Graphical models.   Clarendon Press, 1996, vol. 17.
  • [7]

    D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”

    IEEE signal processing magazine, vol. 30, no. 3, pp. 83–98, 2013.
  • [8] A. Ortega, P. Frossard, J. Kovacevic, J. M. Moura, and P. Vandergheynst, “Graph Signal Processing: Overview, Challenges, and Applications,” Proc. IEEE, vol. 106, no. 5, pp. 808–828, 2018.
  • [9] A. Sandryhaila and J. M. Moura, “Discrete signal processing on graphs,” IEEE transactions on signal processing, vol. 61, no. 7, pp. 1644–1656, 2013.
  • [10] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro, “Connecting the dots: Identifying network structure via graph signal processing,” IEEE Signal Processing Magazine, vol. 36, no. 3, pp. 16–43, 2019.
  • [11] X. Dong, D. Thanou, M. Rabbat, and P. Frossard, “Learning graphs from data: A signal representative perspective,” IEEE Signal Processing Magazine, vol. 36, no. 3, pp. 44–63, 2019.
  • [12] D. Thanou, X. Dong, D. Kressner, and P. Frossard, “Learning heat diffusion graphs,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 484–499, 2017.
  • [13] B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat, “Characterization and Inference of Graph Diffusion Processes from Observations of Stationary Signals,” IEEE Trans. Signal Inf. Process. over Networks, vol. 4, no. 3, pp. 481–496, 2018.
  • [14] R. Shafipour, S. Segarra, A. G. Marques, and G. Mateos, “Identifying the Topology of Undirected Networks from Diffused Non-stationary Graph Signals,” no. 1, pp. 1–14, 2018. [Online]. Available: http://arxiv.org/abs/1801.03862
  • [15] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, “Network topology inference from spectral templates,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 467–483, 2017.
  • [16] V. Kalofolias, “How to learn a graph from smooth signals,” in Artificial Intelligence and Statistics, 2016, pp. 920–929.
  • [17] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learning laplacian matrix in smooth graph signal representations,” IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6160–6173, 2016.
  • [18] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from data under laplacian and structural constraints,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017.
  • [19] S. P. Chepuri, S. Liu, G. Leus, and A. O. Hero, “Learning sparse graphs under smoothness prior,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 6508–6512.
  • [20] P. Berger, G. Hannak, and G. Matz, “Efficient graph learning from noisy and incomplete data,” IEEE Transactions on Signal and Information Processing over Networks, vol. 6, pp. 105–119, 2020.
  • [21] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neural relational inference f