 # Supervised Linear Regression for Graph Learning from Graph Signals

We propose a supervised learning approach for predicting an underlying graph from a set of graph signals. Our approach is based on linear regression. In the linear regression model, we predict edge-weights of a graph as the output, given a set of signal values on nodes of the graph as the input. We solve for the optimal regression coefficients using a relevant optimization problem that is convex and uses a graph-Laplacian based regularization. The regularization helps to promote a specific graph spectral profile of the graph signals. Simulation experiments demonstrate that our approach predicts well even in presence of outliers in input data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Graph learning in the context of graph signal processing refers to the problem of learning associations between different nodes/agents of a graph or network. A network structure is inferred from the given signal values at the different nodes. Graph learning is part of many analysis and processing tasks, such as clustering, community detection, prediction of signal values, or for predicting entire graph signals. Various models have been proposed to infer a graph from a set of signals . Most notable works include graph inference from smooth signals   , based on the assumption that signals vary slowly over the graph structure. Pasdeloup et al.  and Segarra et al.  assume signals are given as a result of an arbitrary graph filtering process while learning the graph. Similarly, Mei et al. 

propose a polynomial autoregressive model for graph signals and a method to infer both the graph and coefficients of the polynomial.

We note that the aforementioned works take a one-shot approach by learning the graph that best describes a given set of graph signals under suitable constraints. They do not explicitly use a training dataset with labeled graphs and graph signals, and hence, may be seen as unsupervised learning approaches for graph inference.

In this paper, we propose a supervised learning approach for predicting graphs from graph signals. A motivating example of supervised graph learning approach can arise in a social network scenario. In social networks, nodes represent different individuals / persons. Let us assume that we have a training dataset comprising graph signals and underlying graphs. The graph signals may comprised of different features, such as age, height, salary, food tastes, consumer habits, etc of the individuals. An underlying graph could be the one formed by a rule based on who follows whom, or a friendship list of individuals. Now, in the case of test data, we may have privacy, security or legal reasons for not revealing the true underlying graph. The task is then to estimate the underlying graph from observed graph signals for the test case.

To the best of authors’ knowledge, there exist no prior work on exploring supervised learning approach for graph learning. A supervised learning approach incorporates prior knowledge through training. In our approach, we model the edge-weights of the graph adjacency matrix as the predicted output of a linear regression model with an input consisting of a set of graph signal observations. We compute the optimal regression coefficients from training data by solving an optimization problem with a regularization based on the graph spectral profile of the graph signals. In order to make that the optimization problem convex, we use graph spectral profiles in the form of second order polynomial of the graph-Laplacian matrix. We then discuss how for a suitably constructed input feature, the regression coefficients represent a weighting of the different graph signals in the input for the prediction task. Simulation experiments reveal that our approach gives good performance for graph learning under difficult conditions, for examples, if training dataset is limited and noisy, and test input is also noisy. A block scheme summarizing our approach is shown in Figure 1.

## Ii Linear regreSsion for graph learning

We first review the relevant basics from graph signal processing and thereafter propose linear regression for graph prediction.

### Ii-a Graph signal processing preliminaries

A graph signal refers to a vector whose components denote the values of a signal over different nodes of an associated graph. The relation between the different nodes are quantified in using a weighted edge, and the graph is described using the adjacency matrix

whose th entry denotes the edge-weight between th and th nodes. In this work, we consider only undirected graphs, which means . The smoothness of a graph signal over a graph with nodes is typically defined using the quantity where is the graph-Laplacian matrix[8, 9], and is the diagonal degree matrix with , being the -dimensional vector of all ones. A small value of implies that the values across connected nodes are similar, leading to the notion of a smooth graph signal. A graph signal

is also equivalently described in terms of its graph-Fourier transform which is defined as

 ^x≜V⊤x,

where

denotes the eigenvector matrix of

,

is the eigenvalue matrix arranged according to ascending values. By construction,

, and the eigenvectors belonging to the smaller vary smoothly over the graph and represent low-frequencies, and those of larger vary more rapidly, denoting the high-frequencies.

In order to impose that follows a particular graph-spectral profile (in terms of the distribution of its graph Fourier spectral coefficients), the regularization is often employed, where is a polynomial of order . This is because the regularization penalizes the different components of as

 x⊤h(L)x=x⊤Vh(Λ)V⊤x=^x⊤h(Λ)^x=N∑i=1h(λi)^x(i)2.

In the case of smooth graph signals, is usually employed since , which penalizes the high-frequency components of more than the low-frequency ones. Similarly, setting where is the pseudo-inverse of leads to , which promotes to have high-frequency behaviour. We refer the reader to [9, 10] and the references therein for a more comprehensive view of graph signal processing framework.

### Ii-B Linear regression model for graph prediction

Linear and kernel regression form the workhorse of a gamut of applications which involves learning from from support vector machines

 to prediction and reconstruction of graph signals [13, 14, 15, 16, 17, 18]. In this Section, we propose a graph prediction approach using linear regression. We note that in this paper we use the terms graph prediction and graph learning interchangeably.

Let us assume that we have a training set of one or more graphs indexed by , . Let the th graph have nodes. We further assume that we have graph signals for each of the graphs as input. Let denote the weighted adjacency matrix of the th graph. Then the input-output pairs are given by where denotes the matrix with the graph signals as columns. We consider the following model for predicting weight of the edge between the th and th nodes:

 a(g)i,j=w⊤ϕ(x(g)(i),x(g)(j))+\color[rgb]{0,0,0}model noise. (1)

Here is the regression coefficient vector, is a -dimensional feature vector where is the ’th row vector of as follows

 x(g)(i)=[x(g)1(i),⋯x(g)m(i),⋯x(g)M(i)]⊤∈RM.

Thus, the estimate of is given by

 ^a(g)i,j=w⊤ϕ(x(g)(i),x(g)(j)). (2)

The input feature vector is assumed to be known. In the general case, it could be an arbitrary function of the input signal. Intuitively, for our problem it is desirable that the values of should reflect on the similarity of the signal values between the nodes and . The smaller the values of , the larger must be in order to ensure a strong edge between nodes and . Similarly, dissimilar values across the nodes with large values of should result in a with small values. Though multiple such could be constructed, we use a simple choice with the th component of defined by

 ϕ(x(i),x(j))(m)=σmax((xm(i)−xm(j)2,σ).1≤m≤M

is a parameter introduced to avoid being unbounded when the signal values at nodes and are very similar. Thus, we observe that the th component of reflects the similarity of the values of the th graph signal at the ’th and ’th nodes. Correspondingly, the components of represent the relative importance of the graph signals in predicting the graph. In order to ensure that the graphs have no self loops, that is, , we make the additional definition that when .

Then, by collecting all the edge-weights predicted by the regression model (2) for the th graph as a matrix, we have the adjacency matrix estimate for the th graph given by

 ^A(g)=Φ(g)Wg,∀g,where (3)
 Φ(g) =⎡⎢ ⎢ ⎢ ⎢⎣ϕ(x(g)(1),x(g)(1))⊤⋯ϕ(x(g)(1),x(g)(Ng))⊤⋮ϕ(x(g)(Ng),x(g)(1))⊤⋯ϕ(x(g)(Ng),x(g)(Ng))⊤⎤⎥ ⎥ ⎥ ⎥⎦, Wg =⎡⎢ ⎢ ⎢ ⎢⎣w0⋯00w⋯0⋮00⋯w⎤⎥ ⎥ ⎥ ⎥⎦=INg⊗w;

denotes the Kronecker product operation and

is the identity matrix of size

. Then, the corresponding graph-Laplacian estimate is given by

 ^L(g) = ^D(g)−^A(g)=% diag(Φ(g)Wg1Ng)=diag(A(g)1N)=^D(g)−Φ(g)Wg,∀g (4) (a)= Ng∑n=1e⊤n(Φ(g)Wg1Ng)ene⊤n−Φ(g)Wg,

where is the all ones column vector of length and is the column vector with all zeros except one at the th component. The equality follows from the matrix identity where .

### Ii-C Linear Regression for Graph Prediction

Given Eq. (1), (3), (4), our goal is to compute the optimal regression coefficients such that the following cost is minimized

 J(w) =G∑g=1∥A(g)−^A(g)∥2F+αG∑g=1tr(X(g)⊤h(^L(g))X(g)) +βGG∑g=1tr(W⊤gWg),

where the first regularization term imposes the learnt graphs to have the desired graph-spectral profile (as discussed in Section II-A). The second regularization ensures that remains bounded. In imposing the regularization, we have implicitly assumed that the graph signals follow a particular graph spectral profile over the associated graph. This assumption is reasonable in cases such as social networks where the different communities in the graph might still have similar dynamics or distribution of features across the nodes. Now, if we make the further simplifying assumption that all training graphs have the same size , is expressible as follows:

 J(w) =G∑g=1∥A(g)−Φ(g)W∥2F (6) +αG∑g=1tr(X(g)⊤h(^L(g))X(g))+βtr(W⊤W),

where . We note that (6) is not convex in in general. Convex optimization problems have a global minimum and often resulting in tractable closed form solutions. This makes it desirable that in Eq. (6) be convex. This directly translates to the requirement that be a second order polynomial of the form . A second order is nevertheless fairly generic and can represent various kinds of graph signal behaviour such as low-pass, high-pass, etc[9, 19]. As is now convex, the unique global optimal value of is obtained by setting the derivative of with respect to equal to zero. This leads to the following proposition:

###### Proposition 1.

The optimal regression coefficients that minimizes the cost in Eq. (6) satifies where denotes the matrix operation that returns the submatrix of with only columns indexed by set , and quantities , , , and are as defined in Eq. (II-B).

###### Proof.

The proof follows from the use of matrix calculus to take the gradient of with respect to

, and uses chain rule and other standard properties of Kronecker product and vectorization.

 J(w)=G∑g=1∥A(g)−Φ(g)W∥2F+αG∑g=1tr(X(g)⊤hg(L(g))X(g))+βNtr(w⊤w)

We shall hereafter use and to denote and , to keep the notation simple. Then, from (3) and (4) we have that

 J(w) =G∑g=1∥A(g)−Φ(g)W∥2FJ1(W) +αG∑g=1tr(X(g)⊤hg(N∑n=1e⊤n(Φ(g)W)(1Nen)e⊤n−Φ(g)W)X(g))J2(W) +βtr(W⊤W)J3(W) =J1(W)+αJ2(W)+βJ3(W) (7)

### Ii-D Simplifying cost function

We now analyze these terms separately:

 J1(W) =∑g∥A(g)−Φ(g)W∥2F=∑gtr([A(g)−Φ(g)W]⊤[A(g)−Φ(g)W]) =∑gtr([A(g)]⊤A(g))−2∑gtr([A(g)]⊤Φ(g)W)+∑gtr([Φ(g)W]⊤[Φ(g)W]) =∑gtr([A(g)]⊤A(g))−2∑gtr([A(g)]⊤Φ(g)W)+∑gtr(W⊤Φ(g)⊤Φ(g)W) =∑gtr([A(g)]⊤A(g))−2tr(∑g[A(g)]⊤Φ(g)W)+tr(W⊤∑g[Φ(g)⊤Φ(g)]W)
 J2(W) +G∑g=1tr⎛⎜ ⎜ ⎜⎝h(g)1⎡⎢ ⎢ ⎢⎣N∑n=1e⊤n(Φ(g)W1N)scalarene⊤n−Φ(g)W⎤⎥ ⎥ ⎥⎦X(g)X(g)⊤⎞⎟ ⎟ ⎟⎠ +G∑g=1tr(h(g)2[N∑n1=1N∑n2=1e⊤n1(Φ(g)W1N)en1e⊤n1e⊤n2(Φ(g)W1N)en2e⊤n2]X(g)X(g)⊤) +G∑g=1tr(−2h(g)2[N∑n=1e⊤n(Φ(g)W1N)ene⊤nΦ(g)W]X(g)X(g)⊤) +G∑g=1tr(h(g)2[Φ(g)WΦ(g)W]X(g)X(g)⊤)
 J2(W) (a)=G∑g=1tr(h(g)0X(g)X(g)⊤) +G∑g=1tr(h(g)1[−Φ(g)W]X(g)X(g)⊤) +G∑g=1tr(h(g)2[N∑n1=1N∑n2=1e⊤n1(W⊤Φ(g)⊤1N)en1e⊤n1e⊤n2(Φ(g)W1N)en2e⊤n2]X(g)X(g)⊤) +G∑g=1tr(−2h(g)2[N∑n=1e⊤n(W⊤Φ(g)⊤1N)ene⊤nΦ(g)W]X(g)X(g)⊤) +G∑g=1tr(h(g)2[W⊤Φ(g)⊤Φ(g)W]X(g)X(g)⊤)
 J2(W) (a)=G∑g=1tr(h(g)0X(g)X(g)⊤) +G∑g=1h(g)1N∑n=1e⊤n(Φ(g)W1N)tr(ene⊤nX(g)X(g)⊤) +G∑g=1tr(h(g)1[−Φ(g)W]X(g)X(g)⊤) +G∑g=1h(g)2N∑n1=1N∑n2=1e⊤n1(W⊤Φ(g)⊤1N)e⊤n2(Φ(g)W1N)tr(en1e⊤n1en2e⊤n2X(g)X(g)⊤) +G∑g=1−2h(g)2N∑n=1e⊤n(W⊤Φ(g)⊤1N)tr% (ene⊤nΦ(g)WX(g)X(g)⊤) +G∑g=1tr(h(g)2[W⊤Φ(g)⊤Φ(g)W]X(g)X(g)⊤)
 J2(W) (a)=G∑g=1tr(h(g)0X(g)X(g)⊤) +G∑g=1h(g)1N∑n=1tr(e⊤n(Φ(g)W1N))tr(ene⊤nX(g)X(g)⊤) +G∑g=1tr(h(g)1[−Φ(g)W]X(g)X(g)⊤) +G∑g=1h(g)2N∑n1=1N∑n2=1tr(e⊤n1(W⊤Φ(g)⊤1N)e⊤n2(Φ(g)W1N))tr(en1e⊤n1en2e⊤n2X(g)X(g)⊤) +G∑g=1tr(h(g)2[W⊤Φ(g)⊤Φ(g)W]X(g)X(g)⊤)
 J2(W) (a)=G∑g=1tr(h(g)0X(g)X(g)⊤) +G∑g=1h(g)1N∑n=1tr(W1Ne⊤nΦ(g))tr(ene⊤nX(g)X(g)⊤) +G∑g=1tr(h(g)1[−Φ(g)W]X(g)X(g)⊤) +G∑g=1h(g)2N∑n1=1N∑n2=1tr(W⊤Φ(g)⊤1Ne⊤n2Φ(g)W1Ne⊤n1)tr(en1e⊤n1en2e⊤n2X(g)X(g)⊤) +G∑g=1−2h(g)2N∑n=1tr(W⊤Φ(g)⊤1Ne⊤n)tr(X(g)X(g)⊤ene⊤nΦ(g)W) +G∑g=1tr(h(g)2[W⊤Φ(g)⊤Φ(g)W]X(g)X(g)⊤) (10)

### Ii-E Taking derivatives of the cost function parts with respect to W

In order to keep the mathematics self-contained, we list here some properties of matrix calculus which we shall be using later:

 vec(AXB)=(B⊤⊗A)vec(X) ∂tr(X⊤AXB)∂X=AXB+A⊤XB⊤ tr(A⊤B)=(vecA)⊤vecB

Then, from (LABEL:eq:glkr_J1) we have that

 ∂J1(W)∂W =−2∑gΦ(g)⊤A(g)+2∑g[Φ(g)⊤Φ(g)]W (11)

Similarly, from (10) we have that

 ∂J2(W)∂W (a)=G∑g=1h(g)1N∑n=1Φ(g)⊤en1⊤Ntr(ene⊤nX(g)X(g)⊤) −G∑g=1h(g)1Φ(g)⊤X(g)X(g)⊤ +G∑g=1h(g)2N∑n1=1N∑n2=1Φ(g)⊤1Ne⊤n2Φ(g)W1Ne⊤n1% tr(en1e⊤n1en2e⊤n2X(g)X(g)⊤) +G∑g=1h(g)2N∑n1=1N∑n2=1Φ(g)⊤en21⊤NΦ(g)Wen11⊤N% tr(en1e⊤n1en2e⊤n2X(g)X(g)⊤) +G∑g=1−2h(g)2N∑n=1Φ(g)⊤1Ne⊤ntr(X(g)X(g)⊤ene⊤nΦ(g)W) +G∑g=1−2h(g)2N∑n=1tr(W⊤Φ(g)⊤1Ne⊤n)Φ(g)⊤ene⊤nX(g)X(g)⊤ +2G∑g=1h(g)2[Φ(g)⊤Φ(g)]W[X(g)X(g)⊤] (12)

And finally,

 ∂J3(W)∂W=2W (13)

Then, from (11), (12), and (13), we have that

 ∂J(W)∂W =−2∑gΦ(g)⊤A(g)+2∑g[Φ(g)⊤Φ(g)]W +αG∑g=1h(g)1N∑n=1Φ(g)⊤en1⊤Ntr(ene⊤nX(g)X(g)⊤) −αG∑g=1h(g)1Φ(g)⊤X(g)X(g)⊤ +α