# Kernel-based Reconstruction of Graph Signals

A number of applications in engineering, social sciences, physics, and biology involve inference over networks. In this context, graph signals are widely encountered as descriptors of vertex attributes or features in graph-structured data. Estimating such signals in all vertices given noisy observations of their values on a subset of vertices has been extensively analyzed in the literature of signal processing on graphs (SPoG). This paper advocates kernel regression as a framework generalizing popular SPoG modeling and reconstruction and expanding their capabilities. Formulating signal reconstruction as a regression task on reproducing kernel Hilbert spaces of graph signals permeates benefits from statistical learning, offers fresh insights, and allows for estimators to leverage richer forms of prior information than existing alternatives. A number of SPoG notions such as bandlimitedness, graph filters, and the graph Fourier transform are naturally accommodated in the kernel framework. Additionally, this paper capitalizes on the so-called representer theorem to devise simpler versions of existing Thikhonov regularized estimators, and offers a novel probabilistic interpretation of kernel methods on graphs based on graphical models. Motivated by the challenges of selecting the bandwidth parameter in SPoG estimators or the kernel map in kernel-based methods, the present paper further proposes two multi-kernel approaches with complementary strengths. Whereas the first enables estimation of the unknown bandwidth of bandlimited signals, the second allows for efficient graph filter selection. Numerical tests with synthetic as well as real data demonstrate the merits of the proposed methods relative to state-of-the-art alternatives.

## Authors

• 8 publications
• 9 publications
• 99 publications
11/25/2017

### Inference of Spatio-Temporal Functions over Graphs via Multi-Kernel Kriged Kalman Filtering

Inference of space-time varying signals on graphs emerges naturally in a...
03/12/2018

### Multi-kernel Regression For Graph Signal Processing

We develop a multi-kernel based regression method for graph signal proce...
08/23/2020

### Kernel-based Graph Learning from Smooth Signals: A Functional Viewpoint

The problem of graph learning concerns the construction of an explicit t...
11/28/2017

### Kernel-based Inference of Functions over Graphs

The study of networks has witnessed an explosive growth over the past de...
10/25/2020

### A Hierarchical Graph Signal Processing Approach to Inference from Spatiotemporal Signals

Motivated by the emerging area of graph signal processing (GSP), we intr...
08/28/2015

### Regularized Kernel Recursive Least Square Algoirthm

In most adaptive signal processing applications, system linearity is ass...
01/24/2019

### Graph heat mixture model learning

Graph inference methods have recently attracted a great interest from th...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Graph data play a central role in analysis and inference tasks for social, brain, communication, biological, transportation, and sensor networks [1], thanks to their ability to capture relational information. Vertex attributes or features associated with vertices can be interpreted as functions or signals defined on graphs. In social networks for instance, where a vertex represents a person and an edge corresponds to a friendship relation, such a function may denote e.g. the person’s age, location, or rating of a given movie.

Research efforts over the last years are centered on estimating or processing functions on graphs; see e.g. [2, 3, 1, 4, 5, 6]. Existing approaches rely on the premise that signals obey a certain form of parsimony relative to the graph topology. For instance, it seems reasonable to estimate a person’s age by looking at their friends’ age. The present paper deals with a general version of this task, where the goal is to estimate a graph signal given noisy observations on a subset of vertices.

The machine learning community has already looked at SPoG-related issues in the context of

semi-supervised learning under the term of transductive regression and classification [6, 7, 8]. Existing approaches rely on smoothness assumptions for inference of processes over graphs using nonparametric methods [6, 2, 3, 9]. Whereas some works consider estimation of real-valued signals [7, 9, 8, 10], most in this body of literature have focused on estimating binary-valued functions; see e.g. [6]. On the other hand, function estimation has also been investigated recently by the community of signal processing on graphs (SPoG) under the term signal reconstruction [11, 12, 13, 14, 15, 16, 17, 18]. Existing approaches commonly adopt parametric estimation tools and rely on bandlimitedness, by which the signal of interest is assumed to lie in the span of the

[19, 14, 16, 12, 13, 17, 18]. Different from machine learning works, SPoG research is mainly concerned with estimating real-valued functions.

The present paper cross-pollinates ideas and broadens both machine learning and SPoG perspectives under the unifying framework of kernel-based learning. The first part unveils the implications of adopting this standpoint and demonstrates how it naturally accommodates a number of SPoG concepts and tools. From a high level, this connection (i) brings to bear performance bounds and algorithms from transductive regression [8] and the extensively analyzed general kernel methods (see e.g. [20]); (ii) offers the possibility of reducing the dimension of the optimization problems involved in Tikhonov regularized estimators by invoking the so-called representer theorem [21]; and, (iii) it provides guidelines for systematically selecting parameters in existing signal reconstruction approaches by leveraging the connection with linear minimum mean-square error (LMMSE) estimation via covariance kernels.

Further implications of applying kernel methods to graph signal reconstruction are also explored. Specifically, it is shown that the finite dimension of graph signal spaces allows for an insightful proof of the representer theorem which, different from existing proofs relying on functional analysis, solely involves linear algebra arguments. Moreover, an intuitive probabilistic interpretation of graph kernel methods is introduced based on graphical models. These findings are complemented with a technique to deploy regression with Laplacian kernels in big-data setups.

It is further established that a number of existing signal reconstruction approaches, including the least-squares (LS) estimators for bandlimited signals from [12, 13, 14, 15, 16, 11]; the Tikhonov regularized estimators from [12, 22, 4] and [23, eq. (27)]

; and the maximum a posteriori estimator in

[13], can be viewed as kernel methods on reproducing kernel Hilbert spaces (RKHSs) of graph signals. Popular notions in SPoG such as graph filters, the graph Fourier transform, and bandlimited signals can also be accommodated under the kernel framework. First, it is seen that a graph filter [4] is essentially a kernel smoother [24]. Second, bandlimited kernels are introduced to accommodate estimation of bandlimited signals. Third, the connection between the so-called graph Fourier transform [4] (see [15, 5] for a related definition) and Laplacian kernels [2, 3] is delineated. Relative to methods relying on the bandlimited property (see e.g. [12, 13, 14, 15, 16, 11, 17]), kernel methods offer increased flexibility in leveraging prior information about the graph Fourier transform of the estimated signal.

The second part of the paper pertains to the challenge of model selection. On the one hand, a number of reconstruction schemes in SPoG [12, 13, 14, 15, 17] require knowledge of the signal bandwidth, which is typically unknown [11, 16]. Existing approaches for determining this bandwidth rely solely on the set of sampled vertices, disregarding the observations [11, 16]. On the other hand, existing kernel-based approaches [1, Ch. 8] necessitate proper kernel selection, which is computationally inefficient through cross-validation.

The present paper addresses both issues by means of two multi-kernel learning (MKL) techniques having complementary strengths. Heed existing MKL methods on graphs are confined to estimating binary-valued signals [25, 26, 27]. This paper on the other hand, is concerned with MKL algorithms for real-valued graph signal reconstruction. The novel graph MKL algorithms optimally combine the kernels in a given dictionary and simultaneously estimate the graph signal by solving a single optimization problem.

The rest of the paper is structured as follows. Sec. II formulates the problem of graph signal reconstruction. Sec. III presents kernel-based learning as an encompassing framework for graph signal reconstruction, and explores the implications of adopting such a standpoint. Two MKL algorithms are then presented in Sec. IV. Sec. V complements analytical findings with numerical tests by comparing with competing alternatives via synthetic- and real-data experiments. Finally, concluding remarks are highlighted in Sec. VI.

Notation. denotes the remainder of integer division by ; the Kronecker delta, and the indicator of condition , returning 1 if

is satisfied and 0 otherwise. Scalars are denoted by lowercase letters, vectors by bold lowercase, and matrices by bold uppercase. The

th entry of matrix is . Notation and respectively represent Euclidean norm and trace; denotes the identity matrix; is the -th canonical vector of , while () is a vector of appropriate dimension with all (ones). The span of the columns of is denoted by , whereas (resp. ) means that is positive definite (resp. semi-definite). Superscripts and respectively stand for transposition and pseudo-inverse, whereas denotes expectation.

## Ii Problem Statement

A graph is a tuple , where is the vertex set, and is a map assigning a weight to each vertex pair. For simplicity, it is assumed that . This paper focuses on undirected graphs, for which . A graph is said to be unweighted if is either 0 or 1. The edge set is the support of , i.e., . Two vertices and are adjacent, connected, or neighbors if . The -th neighborhood is the set of neighbors of , i.e., . The information in is compactly represented by the weighted adjacency matrix , whose -th entry is ; the diagonal degree matrix , whose -th entry is ; and the Laplacian matrix , which is symmetric and positive semidefinite [1, Ch. 2]. The latter is sometimes replaced with its normalized version

, whose eigenvalues are confined to the interval

.

A real-valued function (or signal) on a graph is a map . As mentioned in Sec. I, the value represents an attribute or feature of , such as age, political alignment, or annual income of a person in a social network. Signal is thus represented by .

Suppose that a collection of noisy samples (or observations) , is available, where models noise and contains the indices of the sampled vertices. In a social network, this may be the case if a subset of persons have been surveyed about the attribute of interest (e.g. political alignment). Given , and assuming knowledge of , the goal is to estimate . This will provide estimates of both at observed and unobserved vertices . By defining , the observation model is summarized as

 y=Φf0+e (1)

where and is an matrix with entries , , set to one, and the rest set to zero.

## Iii Unifying the reconstruction of graph signals

Kernel methods constitute the “workhorse” of statistical learning for nonlinear function estimation [20]. Their popularity can be ascribed to their simplicity, flexibility, and good performance. This section presents kernel regression as a novel unifying framework for graph signal reconstruction.

Kernel regression seeks an estimate of in an RKHS , which is the space of functions defined as

 (2)

The kernel map is any function defining a symmetric and positive semidefinite matrix with entries  [28]. Intuitively, is a basis function in (2) measuring similarity between the values of at and . For instance, if a feature vector containing attributes of the entity represented by is known for , one can employ the popular Gaussian kernel , where is a user-selected parameter [20]. When such feature vectors are not available, the graph topology can be leveraged to construct graph kernels as detailed in Sec. III-B.

Different from RKHSs of functions defined over infinite sets, the expansion in (2) is finite since is finite. This implies that RKHSs of graph signals are finite-dimensional spaces. From (2), it follows that any signal in can be expressed as:

 (3)

for some vector . Given two functions and , their RKHS inner product is defined as111Whereas denotes a function, symbol represents the scalar resulting from evaluating at vertex .

 (4)

where . The RKHS norm is defined by

 (5)

and will be used as a regularizer to control overfitting. As a special case, setting recovers the standard inner product , and Euclidean norm . Note that when , the set of functions of the form (3) equals . Thus, two RKHSs with strictly positive definite kernel matrices contain the same functions. They differ only in their RKHS inner products and norms. Interestingly, this observation establishes that any positive definite kernel is universal [29] for graph signal reconstruction.

The term reproducing kernel stems from the reproducing property. Let denote the map , where . Using (4), the reproducing property can be expressed as . Due to the linearity of inner products and the fact that all signals in are the superposition of functions of the form , the reproducing property asserts that inner products can be obtained just by evaluating . The reproducing property is of paramount importance when dealing with an RKHS of functions defined on infinite spaces (thus excluding RKHSs of graph signals), since it offers an efficient alternative to the costly multidimensional integration required by inner products such as .

Given , RKHS-based function estimators are obtained by solving functional minimization problems formulated as

 ^f0:=argminf∈H1SS∑s=1L(vns,ys,f(vns))+μΩ(||f||H) (6)

where the regularization parameter controls overfitting, the increasing function

is used to promote smoothness, and the loss function

measures how estimates deviate from the data. The so-called square loss constitutes a popular choice for , whereas is often set to or .

To simplify notation, consider loss functions expressible as ; extensions to more general cases are straightforward. The vector-version of such a function is . Substituting (3) and (5) into (6) shows that can be obtained as , where

 ^¯α:=argmin¯α∈RN1SL(y−Φ¯K¯α)+μΩ((¯αT¯K¯α)1/2). (7)

An alternative form of (7) that will be frequently used in the sequel results upon noting that . Thus, one can rewrite (7) as

 (8)

If , the constraint can be omitted, and can be replaced with . If contains null eigenvalues, it is customary to remove the constraint by replacing (or ) with a perturbed version (respectively ), where is a small constant. Expression (8) shows that kernel regression unifies and subsumes the Tikhonov-regularized graph signal reconstruction schemes in [12, 22, 4] and [23, eq. (27)] by properly selecting , , and (see Sec. III-B).

### Iii-a Representer theorem

Although graph signals can be reconstructed from (7), such an approach involves optimizing over variables. This section shows that a solution can be obtained by solving an optimization problem in variables, where typically .

The representer theorem [21, 28] plays an instrumental role in the non-graph setting of infinite-dimensional , where (6) cannot be directly solved. This theorem enables a solver by providing a finite parameterization of the function in (6). On the other hand, when comprises graph signals, (6) is inherently finite-dimensional and can be solved directly. However, the representer theorem can still be beneficial to reduce the dimension of the optimization in (7).

###### Theorem 1 (Representer theorem).

prop:representer The solution to the functional minimization in (6) can be expressed as

 (9)

for some , .

The conventional proof for the representer theorem involves tools from functional analysis [28]. However, when comprises functions defined on finite spaces, such us graph signals, an insightful proof can be obtained relying solely on linear algebra arguments (see Appendix A).

Since the solution of (6) lies in , it can be expressed as for some . prop:representer states that the terms corresponding to unobserved vertices , , play no role in the kernel expansion of the estimate; that is, . Thus, whereas (7) requires optimization over variables, prop:representer establishes that a solution can be found by solving a problem in variables, where typically . Clearly, this conclusion carries over to the signal reconstruction schemes in [12, 22, 4] and [23, eq. (27)], since they constitute special instances of kernel regression. The fact that the number of parameters to be estimated after applying prop:representer depends on (in fact, equals) the number of samples justifies why in (6) is referred to as a nonparametric estimate.

prop:representer shows the form of but does not provide the optimal , which is found after substituting (9) into (6) and solving the resulting optimization problem with respect to these coefficients. To this end, let , and write to deduce that

 ^f0=¯K¯α=¯KΦTα. (10)

From (7) and (10), the optimal can be found as

 (11)

where .

Example 1

(kernel ridge regression)

. For chosen as the square loss and , the in (6) is referred to as the kernel ridge regression estimate. It is given by , where

 ^α (12a) =(K+μSIS)−1y. (12b)

Therefore, can be expressed as

 (13)

As seen in the next section, (13) generalizes a number of existing signal reconstructors upon properly selecting . Thus, prop:representer can also be used to simplify Tikhonov-regularized estimators such as the one in [12, eq. (15)]. To see this, just note that (13) inverts an matrix whereas [12, eq. (16)] entails the inversion of an matrix.

Example 2 (support vector regression). If equals the so-called -insensitive loss and , then (6

) constitutes a support vector machine for regression (see e.g.

[20, Ch. 1]).

### Iii-B Graph kernels for signal reconstruction

When estimating functions on graphs, conventional kernels such as the aforementioned Gaussian kernel cannot be applied because the underlying set where graph signals are defined is not a metric space. Indeed, no vertex addition , scaling , or norm can be naturally defined on . An alternative is to embed into an Euclidean space via a feature map , and apply a conventional kernel afterwards. However, for a given graph it is generally unclear how to design such a map or select , which motivates the adoption of graph kernels [3]. The rest of this section elaborates on three classes of graph kernels, namely Laplacian, bandlimited, and novel covariance kernels for reconstructing graph signals.

#### Iii-B1 Laplacian kernels

The term Laplacian kernel comprises a wide family of kernels obtained by applying a certain function to the Laplacian matrix . From a theoretical perspective, Laplacian kernels are well motivated since they constitute the graph counterpart of the so-called translation invariant kernels in Euclidean spaces [3]

. This section reviews Laplacian kernels, provides novel insights in terms of interpolating signals, and highlights their versatility in capturing prior information about the

graph Fourier transform of the estimated signal.

Let denote the eigenvalues of the graph Laplacian matrix , and consider the eigendecomposition , where . A Laplacian kernel is a kernel map generating a matrix of the form

 (14)

where is the result of applying the user-selected non-negative map to the diagonal entries of . For reasons that will become clear, the map is typically increasing in . Common choices include the diffusion kernel  [2], and the -step random walk kernel  [3]. Laplacian regularization [3, 30, 9, 31, 4] is effected by setting with sufficiently large.

Observe that obtaining generally requires an eigendecomposition of , which is computationally challenging for large graphs (). Two techniques to reduce complexity in these big data scenarios are proposed in Appendix B.

At this point, it is prudent to offer interpretations and insights into the principles behind the operation of Laplacian kernels. Towards this objective, note first that the regularizer from (8) is an increasing function of

 (15)

where comprises the projections of onto the eigenvectors of , and is referred to as the graph Fourier transform of in the SPoG parlance [4]. Before interpreting (15), it is worth elucidating the rationale behind this term. Since is orthogonal, one can decompose as

 f=N∑n=1~fnun. (16)

Because vectors , or more precisely their signal counterparts , are eigensignals of the so-called graph shift operator , (16) resembles the classical Fourier transform in the sense that it expresses a signal as a superposition of eigensignals of a Laplacian operator [4]. Recalling from Sec. II that denotes the weight of the edge between and , one can consider the smoothness measure for graph functions given by

where the last equality follows from the definition of . Clearly, it holds . Since , it follows that . In analogy to signal processing for time signals, where lower frequencies correspond to smoother eigensignals, the index , or alternatively the eigenvalue , is interpreted as the frequency of .

It follows from (15) that the regularizer in (8) strongly penalizes those for which the corresponding is large, thus promoting a specific structure in this frequency domain. Specifically, one prefers to be large whenever is small and vice versa. The fact that is expected to decrease with for smooth , motivates the adoption of an increasing  [3]. Observe that Laplacian kernels can capture richer forms of prior information than the signal reconstructors of bandlimited signals in [12, 13, 14, 15, 17, 18], since the latter can solely capture the support of the Fourier transform whereas the former can also leverage magnitude information.

Example 3 (circular graphs). This example capitalizes on prop:representer to present a novel SPoG-inspired intuitive interpretation of nonparametric regression with Laplacian kernels. To do so, a closed-form expression for the Laplacian kernel matrix of a circular graph (or ring) will be derived. This class of graphs has been commonly employed in the literature to illustrate connections between SPoG and signal processing of time-domain signals [5].

Up to vertex relabeling, an unweighted circular graph satisfies . Therefore, its Laplacian matrix can be written as , where is the rotation matrix resulting from circularly shifting the columns of one position to the right, i.e., . Matrix is circulant since its -th row can be obtained by circularly shifting the -st row one position to the right. Hence, can be diagonalized by the standard Fourier matrix [32], meaning

 L=~U~Λ~UH (17)

where is the unitary inverse discrete Fourier transform matrix and . Matrices and replace and since, for notational brevity, the eigendecomposition (17) involves complex-valued eigenvectors and the eigenvalues have not been sorted in ascending order.

From (14), a Laplacian kernel matrix is given by , where has entries . It can be easily seen that , where

 Dm :=IDFT{dn}:=1NN−1∑n=0dnej2πNmn.

If , one has that

 Dm =1NN−1∑n=0ej2πNmnr(2[1−cos(2πn/N)]). (18)

Recall that prop:representer dictates , where . Since and because is periodic in with period , it follows that the vectors are all circularly shifted versions of each other. Moreover, since is positive semidefinite, the largest entry of is precisely the -th one, which motivates interpreting as an interpolating signal centered at , which in turn suggests that the expression can be thought of as a reconstruction equation. From this vantage point, signals play an analogous role to sinc functions in signal processing of time-domain signals. Examples of these interpolating signals are depicted in Fig. 1.

#### Iii-B2 Bandlimited kernels

A number of signal reconstruction approaches in the SPoG literature deal with graph bandlimited signals; see e.g. [12, 13, 14, 15, 17, 16, 18, 11]. Here, the notion of bandlimited kernel is introduced to formally show that the LS estimator for bandlimited signals [12, 13, 14, 15, 16, 11] is a limiting case of the kernel ridge regression estimate from (13). This notion will come handy in Secs. IV and V to estimate the bandwidth of a bandlimited signal from the observations .

Signal is said to be bandlimited if it admits an expansion (16) with supported on a set ; that is,

 (19)

where contains the columns of with indexes in , and is a vector stacking . The bandwidth of can be defined as the cardinality , or, as the greatest element of .

If is bandlimited, it follows from (1) that for some . The LS estimate of is therefore given by [12, 13, 14, 15, 16, 11]

 ^fLS (20a) (20b)

where the second equality assumes that is invertible, a necessary and sufficient condition for the entries of to be identifiable.

The estimate in (20) can be accommodated in the kernel regression framework by properly constructing a bandlimited kernel. Intuitively, one can adopt a Laplacian kernel for which is large if (cf. Sec. III-B1). Consider the Laplacian kernel with

 rβ(λn)={1/βn∈Bβn∉B. (21)

For large , this function strongly penalizes (cf. (15)), which promotes bandlimited estimates. The reason for setting for instead of is to ensure that is non-singular, a property that simplifies the statement and the proofs of some of the results in this paper.

###### Proposition 1.

th:blandkernel Let denote the kernel ridge regression estimate from (13) with kernel as in (21) and . If is invertible, as required by the estimator in (20b) for bandlimited signals, then as .

###### Proof:

See Appendix C. ∎

th:blandkernel shows that the framework of kernel-based regression subsumes LS estimation of bandlimited signals. A non-asymptotic counterpart of th:blandkernel can be found by setting for in (21), and noting that if . Note however that imposing renders a degenerate kernel-based estimate.

#### Iii-B3 Covariance kernels

So far, signal has been assumed deterministic, which precludes accommodating certain forms of prior information that probabilistic models can capture, such as domain knowledge and historical data. A probabilistic interpretation of kernel methods on graphs will be pursued here to show that: (i) the optimal in the MSE sense for ridge regression is the covariance matrix of ; and, (ii) kernel-based ridge regression seeks an estimate satisfying a system of local LMMSE estimation conditions on a Markov random field [33, Ch. 8].

Suppose without loss of generality that

are zero-mean random variables. The LMMSE estimator of

given is the linear estimator minimizing , where the expectation is over all and noise realizations. With , the LMMSE estimate is given by

 (22)

where

denotes the noise variance. Comparing (

22) with (13) and recalling that , it follows that with and . In other words, the similarity measure embodied in the kernel map is just the covariance . A related observation was pointed out in [34] for general kernel methods.

In short, one can interpret kernel ridge regression as the LMMSE estimator of a signal with covariance matrix equal to . This statement generalizes [13, Lemma 1], which requires to be Gaussian, rank-deficient, and .

Recognizing that kernel ridge regression is a linear estimator, readily establishes the following result.

###### Proposition 2.

prop:covkernel If , where denotes the estimator in (13), with kernel matrix , and regularization parameter , it then holds that

for all kernel matrices and .

Thus, for criteria aiming to minimize the MSE, prop:covkernel suggests to be chosen close to . This observation may be employed for kernel selection and for parameter tuning in graph signal reconstruction methods of the kernel ridge regression family (e.g. the Tikhonov regularized estimators from [12, 22, 4] and [23, eq. (27)]), whenever an estimate of can be obtained from historical data. For instance, the function involved in Laplacian kernels can be chosen such that resembles in some sense. Investigating such approaches goes beyond the scope of this paper.

A second implication of the connection between kernel ridge regression and LMMSE estimation involves signal estimation on Markov random fields [33, Ch. 8]. In this class of graphical models, an edge connects with if and are not independent given . Thus, if , then and are independent given . In other words, when is known for all neighbors , function values at non-neighboring vertices do not provide further information. This spatial Markovian property motivates the name of this class of graphical models. Real-world graphs obey this property when the topology captures direct interaction, in the sense that the interaction between the entities represented by two non-neighboring vertices and is necessarily through vertices in a path connecting with .

###### Proposition 3.

prop:markov Let be a Markov random field, and consider the estimator in (13) with , and . Then, it holds that

 (23)

for , where denotes the sample index of the observed vertex , i.e., , and

Here, is the LMMSE estimator of given , , and is the variance of this estimator.

###### Proof.

See Appendix D. ∎

If a (noisy) observation of at is not available, i.e. , then kernel ridge regression finds as the LMMSE estimate of given function values at the neighbors of . However, since the latter are not directly observable, their ridge regression estimates are used instead. Conversely, when is observed, implying that a sample is available, the sought estimator subtracts from this value an estimate of the observation noise . Therefore, the kernel estimate on a Markov random field seeks an estimate satisfying the system of local LMMSE conditions given by (23) for .

Remark 1. In prop:markov, the requirement that is a Markov random field can be relaxed to that of being a conditional correlation graph, defined as a graph where if and are correlated given . Since correlation implies dependence, any Markov random field is also a conditional correlation graph. A conditional correlation graph can be constructed from by setting (see e.g. [35, Th. 10.2]).

Remark 2. Suppose that kernel ridge regression is adopted to estimate a function on a certain graph , not necessarily a Markov random field, using a kernel . Then it can still be interpreted as a method applying (23) on a conditional correlation graph and adopting a signal covariance matrix .

#### Iii-B4 Further kernels

Additional signal reconstructors can be interpreted as kernel-based regression methods for certain choices of . Specifically, it can be seen that [23, eq. (27)] is tantamount to kernel ridge regression with kernel

provided that the adjacency matrix is properly scaled so that this inverse exists. Another example is the Tikhonov regularized estimate in [12, eq. (15)], which is recovered as kernel ridge regression upon setting

 ¯K=[HTH+ϵIN]−1

and letting tend to 0, where can be viewed as a high-pass filter matrix. The role of the term is to ensure that the matrix within brackets is invertible.

### Iii-C Kernel-based smoothing and graph filtering

When an observation is available per vertex for , kernel methods can still be employed for denoising purposes. Due to the regularizer in (6), the estimate will be a smoothed version of . This section shows how ridge regression smoothers can be thought of as graph filters, and vice versa. The importance of this two-way link is in establishing that kernel smoothers can be implemented in a decentralized fashion as graph filters [4].

Upon setting in (13), one recovers the ridge regression smoother . If is a Laplacian kernel, then

 (24)

where .

To see how (24) relates to a graph filter, recall that the latter is an operator assigning , where [4]

 yF (25a) (25b)

Graph filters can be implemented in a decentralized fashion since (25a) involves successive products of by and these products can be computed at each vertex by just exchanging information with neighboring vertices. Expression (25b) can be rewritten in the Fourier domain (cf. Sec. III-B1) as upon defining and . For this reason, the diagonal of is referred to as the frequency response of the filter.

Comparing (24) with (25b) shows that can be interpreted as a graph filter with frequency response . Thus, implementing in a decentralized fashion using (25a) boils down to solving for the system of linear equations . Conversely, given a filter, a Laplacian kernel can be found so that filter and smoother coincide. To this end, assume without loss of generality that , where ; otherwise, simply scale . Then, given , the sought kernel can be constructed by setting

## Iv Multi-kernel graph signal reconstruction

One of the limitations of kernel methods is their sensitivity to the choice of the kernel. To appreciate this, Fig. 2 depicts the normalized mean-square error (NMSE) when is the square loss and across the parameter of the adopted diffusion kernel (see Sec. III-B1). The simulation setting is described in Sec. V. At this point though, it suffices to stress the impact of on the NMSE and the dependence of the optimum on the bandwidth of .

Similarly, the performance of estimators for bandlimited signals degrades considerably if the estimator assumes a frequency support that differs from the actual one. Even for estimating low-pass signals, for which , parameter is unknown in practice. Approaches for setting were considered in [11, 16], but they rely solely on and , disregarding the observations . Note that by adopting the bandlimited kernels from Sec. III-B2, bandwidth selection boils down to kernel selection, so both problems will be treated jointly in the sequel through the lens of kernel-based learning.

This section advocates an MKL approach to kernel selection in graph signal reconstruction. Two algorithms with complementary strengths will be developed. Both select the most suitable kernels within a user-specified kernel dictionary.

### Iv-a RKHS superposition

Since in (6) is determined by , kernel selection is tantamount to RKHS selection. Therefore, a kernel dictionary can be equivalently thought of as an RKHS dictionary , which motivates estimates of the form

 (26)

Upon adopting a criterion that controls sparsity in this expansion, the “best” RKHSs will be selected. A reasonable approach is therefore to generalize (6) to accommodate multiple RKHSs. With selected as the square loss and , one can pursue an estimate by solving

 min{fm∈Hm}Mm=11SS∑s=1[ys−M∑m=1fm(vns)]2+μM∑m=1∥fm∥Hm. (27)

Invoking prop:representer per establishes that the minimizers of (27) can be written as

 ^fm(v)=S∑s=1αmsκm(v,vns),    m=1,…,M (28)

for some coefficients . Substituting (28) into (27) suggests obtaining these coefficients as

 (29)

where , and with . Letting , expression (29) becomes

 (30)

Note that the sum in the regularizer of (30) can be interpreted as the -norm of , which is known to promote sparsity in its entries and therefore in (26). Indeed, (30) can be seen as a particular instance of group Lasso [34].

As shown next, (30) can be efficiently solved using the alternating-direction method of multipliers (ADMM) [36]. To this end, rewrite (30) by defining and , and introducing the auxiliary variable , as

 (31)

ADMM iteratively minimizes the augmented Lagrangian of (31) with respect to and in a block-coordinate descent fashion, and updates the Lagrange multipliers associated with the equality constraint using gradient ascent (see [37] and references therein). The resulting iteration is summarized as Algorithm 1, where is the augmented Lagrangian parameter, is the Lagrange multiplier associated with the equality constraint, and

 Tζ(a):=max(0,||a||2−ζ)||a||2a

is the so-called soft-thresholding operator [36].

After obtaining from Algorithm 1, the wanted function estimate can be recovered as

 (32)

It is recommended to normalize the kernel matrices in order to prevent imbalances in the kernel selection. Specifically, one can scale such that . If is a Laplacian kernel (see Sec. III-B1), where , one can scale to ensure .

Remark 3. Although criterion (27) is reminiscent of the MKL approach of [34], the latter differs markedly because it assumes that the right-hand side of (26) is uniquely determined given , which allows application of (6) over a direct-sum RKHS with an appropriately defined norm. However, this approach cannot be pursued here since RKHSs of graph signals frequently overlap, implying that their sum is not a direct one (cf. discussion after (5)).

### Iv-B Kernel superposition

The MKL algorithm in Sec. IV-A can identify the best subset of RKHSs and therefore kernels, but entails unknowns (cf. (29)). This section introduces an alternative approach entailing only variables at the price of not guaranteeing a sparse kernel expansion.

The approach is to postulate a kernel of the form , where is given and . The coefficients can be found by jointly minimizing (11) with respect to and  [38]

 (33)

where . Except for degenerate cases, problem (33) is not jointly convex in and