## I Introduction

Graph data play a central role in analysis and inference tasks for social, brain, communication, biological, transportation, and sensor networks [1], thanks to their ability to capture relational information. Vertex attributes or features associated with vertices can be interpreted as functions or signals defined on graphs. In social networks for instance, where a vertex represents a person and an edge corresponds to a friendship relation, such a function may denote e.g. the person’s age, location, or rating of a given movie.

Research efforts over the last years are centered on estimating or processing functions on graphs; see e.g. [2, 3, 1, 4, 5, 6]. Existing approaches rely on the premise that signals obey a certain form of parsimony relative to the graph topology. For instance, it seems reasonable to estimate a person’s age by looking at their friends’ age. The present paper deals with a general version of this task, where the goal is to estimate a graph signal given noisy observations on a subset of vertices.

The machine learning community has already looked at SPoG-related issues in the context of

*semi-supervised learning*under the term of

*transductive*regression and classification [6, 7, 8]. Existing approaches rely on smoothness assumptions for inference of processes over graphs using

*nonparametric*methods [6, 2, 3, 9]. Whereas some works consider estimation of real-valued signals [7, 9, 8, 10], most in this body of literature have focused on estimating binary-valued functions; see e.g. [6]. On the other hand, function estimation has also been investigated recently by the community of signal processing on graphs (SPoG) under the term

*signal reconstruction*[11, 12, 13, 14, 15, 16, 17, 18]. Existing approaches commonly adopt

*parametric*estimation tools and rely on

*bandlimitedness*, by which the signal of interest is assumed to lie in the span of the

leading eigenvectors of the graph Laplacian or the adjacency matrix

[19, 14, 16, 12, 13, 17, 18]. Different from machine learning works, SPoG research is mainly concerned with estimating real-valued functions.The present paper cross-pollinates ideas and
broadens both machine learning and SPoG perspectives under the
*unifying* framework of kernel-based learning. The first part unveils the implications of adopting
this standpoint and demonstrates how it naturally accommodates a
number of SPoG concepts and tools. From a high level,
this connection (i) brings to bear performance bounds and
algorithms from transductive regression [8]
and the extensively analyzed general kernel methods (see
e.g. [20]); (ii) offers the
possibility of reducing the dimension of the optimization problems
involved in Tikhonov regularized estimators by invoking the so-called
*representer theorem* [21]; and,
(iii) it provides guidelines for
systematically selecting parameters in existing signal reconstruction
approaches by leveraging the connection with linear minimum
mean-square error (LMMSE) estimation via *covariance kernels*.

Further implications of applying kernel methods to graph signal reconstruction are also explored. Specifically, it is shown that the finite dimension of graph signal spaces allows for an insightful proof of the representer theorem which, different from existing proofs relying on functional analysis, solely involves linear algebra arguments. Moreover, an intuitive probabilistic interpretation of graph kernel methods is introduced based on graphical models. These findings are complemented with a technique to deploy regression with Laplacian kernels in big-data setups.

It is further established that a number of existing signal reconstruction approaches, including the least-squares (LS) estimators for bandlimited signals from [12, 13, 14, 15, 16, 11]; the Tikhonov regularized estimators from [12, 22, 4] and [23, eq. (27)]

; and the maximum a posteriori estimator in

[13], can be viewed as kernel methods on*reproducing kernel Hilbert spaces*(RKHSs) of graph signals. Popular notions in SPoG such as graph filters, the graph Fourier transform, and bandlimited signals can also be accommodated under the kernel framework. First, it is seen that a

*graph filter*[4] is essentially a kernel smoother [24]. Second,

*bandlimited kernels*are introduced to accommodate estimation of bandlimited signals. Third, the connection between the so-called

*graph Fourier transform*[4] (see [15, 5] for a related definition) and Laplacian kernels [2, 3] is delineated. Relative to methods relying on the bandlimited property (see e.g. [12, 13, 14, 15, 16, 11, 17]), kernel methods offer increased flexibility in leveraging prior information about the graph Fourier transform of the estimated signal.

The second part of the paper pertains to the challenge of model selection. On the one hand, a number of reconstruction schemes in SPoG [12, 13, 14, 15, 17] require knowledge of the signal bandwidth, which is typically unknown [11, 16]. Existing approaches for determining this bandwidth rely solely on the set of sampled vertices, disregarding the observations [11, 16]. On the other hand, existing kernel-based approaches [1, Ch. 8] necessitate proper kernel selection, which is computationally inefficient through cross-validation.

The present paper addresses both issues by means of two multi-kernel learning (MKL) techniques having complementary strengths. Heed existing MKL methods on graphs are confined to estimating binary-valued signals [25, 26, 27]. This paper on the other hand, is concerned with MKL algorithms for real-valued graph signal reconstruction. The novel graph MKL algorithms optimally combine the kernels in a given dictionary and simultaneously estimate the graph signal by solving a single optimization problem.

The rest of the paper is structured as follows. Sec. II formulates the problem of graph signal reconstruction. Sec. III presents kernel-based learning as an encompassing framework for graph signal reconstruction, and explores the implications of adopting such a standpoint. Two MKL algorithms are then presented in Sec. IV. Sec. V complements analytical findings with numerical tests by comparing with competing alternatives via synthetic- and real-data experiments. Finally, concluding remarks are highlighted in Sec. VI.

Notation. denotes the remainder of integer division by ; the Kronecker delta, and the indicator of condition , returning 1 if

is satisfied and 0 otherwise. Scalars are denoted by lowercase letters, vectors by bold lowercase, and matrices by bold uppercase. The

th entry of matrix is . Notation and respectively represent Euclidean norm and trace; denotes the identity matrix; is the -th canonical vector of , while () is a vector of appropriate dimension with all (ones). The span of the columns of is denoted by , whereas (resp. ) means that is positive definite (resp. semi-definite). Superscripts and respectively stand for transposition and pseudo-inverse, whereas denotes expectation.## Ii Problem Statement

A graph is a tuple , where
is the vertex set, and is a map
assigning a weight to each vertex pair. For simplicity, it is assumed
that . This paper focuses on
*undirected* graphs, for which . A graph is said to be *unweighted* if is
either 0 or 1. The edge set is the support
of , i.e., . Two vertices and are *adjacent*,
*connected*, or *neighbors* if . The
-th neighborhood is the set of neighbors
of , i.e., . The
information in is compactly represented by the
weighted adjacency matrix , whose
-th entry is
; the
diagonal *degree* matrix , whose
-th entry is ; and the *Laplacian* matrix
, which is symmetric and positive
semidefinite [1, Ch. 2]. The latter is sometimes
replaced with its normalized version

, whose eigenvalues are confined to the interval

.A real-valued function (or signal) on a graph is a map . As mentioned in Sec. I, the value represents an attribute or feature of , such as age, political alignment, or annual income of a person in a social network. Signal is thus represented by .

Suppose that a collection of noisy samples (or observations) , is available, where models noise and contains the indices of the sampled vertices. In a social network, this may be the case if a subset of persons have been surveyed about the attribute of interest (e.g. political alignment). Given , and assuming knowledge of , the goal is to estimate . This will provide estimates of both at observed and unobserved vertices . By defining , the observation model is summarized as

(1) |

where and is an matrix with entries , , set to one, and the rest set to zero.

## Iii Unifying the reconstruction of graph signals

Kernel methods constitute the “workhorse” of statistical learning for nonlinear function estimation [20]. Their popularity can be ascribed to their simplicity, flexibility, and good performance. This section presents kernel regression as a novel unifying framework for graph signal reconstruction.

Kernel regression seeks an estimate of in an RKHS , which is the space of functions defined as

(2) |

The *kernel map* is any function defining a symmetric
and positive semidefinite matrix with entries
[28]. Intuitively, is a basis function in
(2) measuring similarity between the values of
at and . For instance, if a *feature
vector* containing
attributes of the entity represented by is known for
, one can employ the popular
*Gaussian kernel* , where
is a user-selected parameter [20]. When
such feature vectors are not available, the graph
topology can be leveraged to construct graph kernels as detailed in
Sec. III-B.

Different from RKHSs of functions defined over infinite sets, the expansion in (2) is finite since is finite. This implies that RKHSs of graph signals are finite-dimensional spaces. From (2), it follows that any signal in can be expressed as:

(3) |

for some vector .
Given two functions and , their RKHS inner product is defined
as^{1}^{1}1Whereas denotes a *function*, symbol
represents the *scalar* resulting from
evaluating at vertex .

(4) |

where . The RKHS norm is defined by

(5) |

and will be used as a regularizer to control overfitting.
As a special case, setting
recovers the standard inner product , and Euclidean
norm . Note that when
, the set of functions of the form
(3) equals . Thus, two RKHSs
with strictly positive definite kernel matrices contain the same
functions. They differ only in their RKHS inner products and
norms. Interestingly, this observation establishes that any positive
definite kernel is *universal* [29] for graph
signal reconstruction.

The term *reproducing kernel* stems
from the reproducing property. Let
denote the map , where . Using (4), the reproducing
property can be expressed as . Due to the linearity
of inner products and the fact that all signals in are the
superposition of functions of the form
, the reproducing property asserts
that inner products can be obtained just by evaluating
. The reproducing property is of paramount importance when
dealing with an RKHS of functions defined on *infinite*
spaces (thus excluding RKHSs of graph signals), since it offers an
efficient alternative to the costly multidimensional integration
required by inner products such as .

Given , RKHS-based function estimators are obtained by solving functional minimization problems formulated as

(6) |

where the regularization parameter controls overfitting, the increasing function

is used to promote smoothness, and the loss function

measures how estimates deviate from the data. The so-called*square loss*constitutes a popular choice for , whereas is often set to or .

To simplify notation, consider loss functions expressible as ; extensions to more general cases are straightforward. The vector-version of such a function is . Substituting (3) and (5) into (6) shows that can be obtained as , where

(7) |

An alternative form of (7) that will be frequently used in the sequel results upon noting that . Thus, one can rewrite (7) as

(8) |

If , the constraint can be omitted, and can be replaced with . If contains null eigenvalues, it is customary to remove the constraint by replacing (or ) with a perturbed version (respectively ), where is a small constant. Expression (8) shows that kernel regression unifies and subsumes the Tikhonov-regularized graph signal reconstruction schemes in [12, 22, 4] and [23, eq. (27)] by properly selecting , , and (see Sec. III-B).

### Iii-a Representer theorem

Although graph signals can be reconstructed from (7), such an approach involves optimizing over variables. This section shows that a solution can be obtained by solving an optimization problem in variables, where typically .

The representer theorem [21, 28] plays an instrumental role in the non-graph setting of infinite-dimensional , where (6) cannot be directly solved. This theorem enables a solver by providing a finite parameterization of the function in (6). On the other hand, when comprises graph signals, (6) is inherently finite-dimensional and can be solved directly. However, the representer theorem can still be beneficial to reduce the dimension of the optimization in (7).

###### Theorem 1 (Representer theorem).

prop:representer The solution to the functional minimization in (6) can be expressed as

(9) |

for some , .

The conventional proof for the representer theorem involves tools from functional analysis [28]. However, when comprises functions defined on finite spaces, such us graph signals, an insightful proof can be obtained relying solely on linear algebra arguments (see Appendix A).

Since the solution of
(6) lies in , it can be expressed as
for some
.
prop:representer states that the terms corresponding to
unobserved vertices , ,
play no role in the kernel expansion of the estimate; that is,
. Thus,
whereas (7) requires optimization over
variables, prop:representer establishes that a
solution can be found by solving a problem in variables,
where typically . Clearly, this conclusion
carries over to the signal reconstruction schemes
in [12, 22, 4]
and [23, eq. (27)], since they constitute special
instances of kernel regression. The fact that
the number of parameters to be estimated after applying
prop:representer depends on (in fact, equals) the number of
samples justifies why in
(6) is referred to as a *nonparametric
estimate*.

prop:representer shows the form of but does not provide the optimal , which is found after substituting (9) into (6) and solving the resulting optimization problem with respect to these coefficients. To this end, let , and write to deduce that

(10) |

From (7) and (10), the optimal can be found as

(11) |

where .

Example 1

*(kernel ridge regression)*

*kernel ridge regression*estimate. It is given by , where

(12a) | ||||

(12b) |

Therefore, can be expressed as

(13) |

As seen in the next section, (13) generalizes a number of existing signal reconstructors upon properly selecting . Thus, prop:representer can also be used to simplify Tikhonov-regularized estimators such as the one in [12, eq. (15)]. To see this, just note that (13) inverts an matrix whereas [12, eq. (16)] entails the inversion of an matrix.

Example 2 *(support vector regression)*. If
equals the so-called -insensitive loss
and , then (6

) constitutes a support vector machine for regression (see e.g.

[20, Ch. 1]).### Iii-B Graph kernels for signal reconstruction

When estimating functions on graphs, conventional
kernels such as the aforementioned Gaussian kernel cannot be applied
because the underlying set where graph signals are defined is not a
metric space. Indeed, no vertex addition , scaling , or norm
can be naturally defined on .
An alternative is to embed into
an Euclidean space via a feature map , and apply a conventional kernel afterwards.
However, for a given graph it is generally unclear how to design such
a map or select , which motivates the adoption of
graph kernels [3]. The rest of this
section elaborates on three classes of graph kernels, namely
*Laplacian*, *bandlimited*, and novel *covariance*
kernels for reconstructing graph signals.

#### Iii-B1 Laplacian kernels

The term Laplacian kernel comprises a wide family of
kernels obtained by applying a certain function to the Laplacian matrix
. From a theoretical perspective, Laplacian kernels are well
motivated since they constitute the graph counterpart of the so-called
*translation invariant kernels* in Euclidean
spaces [3]

. This section reviews Laplacian kernels, provides novel insights in terms of interpolating signals, and highlights their versatility in capturing prior information about the

*graph Fourier transform*of the estimated signal.

Let denote the eigenvalues of the graph Laplacian matrix , and consider the eigendecomposition , where . A Laplacian kernel is a kernel map generating a matrix of the form

(14) |

where is the result of applying the user-selected non-negative map to the diagonal entries of . For reasons that will become clear, the map is typically increasing in . Common choices include the diffusion kernel [2], and the -step random walk kernel , [3]. Laplacian regularization [3, 30, 9, 31, 4] is effected by setting with sufficiently large.

Observe that obtaining generally
requires an eigendecomposition of , which is computationally
challenging for large graphs (). Two techniques to
reduce complexity in these *big data* scenarios are proposed in
Appendix B.

At this point, it is prudent to offer interpretations and insights into the principles behind the operation of Laplacian kernels. Towards this objective, note first that the regularizer from (8) is an increasing function of

(15) |

where comprises the
projections of onto the eigenvectors of , and is
referred to as the *graph Fourier transform* of in the
SPoG parlance [4]. Before interpreting (15),
it is worth elucidating the rationale behind this term. Since
is orthogonal, one can decompose as

(16) |

Because vectors , or
more precisely their signal counterparts , are
*eigensignals* of the so-called *graph shift operator*
, (16)
resembles the classical Fourier transform in the sense that it
expresses a signal as a superposition of *eigensignals* of a
Laplacian operator [4]. Recalling from Sec. II that
denotes the weight of the edge
between and , one can consider the
smoothness measure for graph functions given by

where the last equality follows from the definition of . Clearly, it holds . Since
, it follows that
. In analogy to signal processing for time
signals, where lower frequencies correspond to smoother eigensignals,
the index , or alternatively the eigenvalue
, is interpreted as the *frequency* of
.

It follows from (15) that the regularizer in (8) strongly penalizes those for which the corresponding is large, thus promoting a specific structure in this frequency domain. Specifically, one prefers to be large whenever is small and vice versa. The fact that is expected to decrease with for smooth , motivates the adoption of an increasing [3]. Observe that Laplacian kernels can capture richer forms of prior information than the signal reconstructors of bandlimited signals in [12, 13, 14, 15, 17, 18], since the latter can solely capture the support of the Fourier transform whereas the former can also leverage magnitude information.

Example 3 *(circular graphs)*.
This example capitalizes on prop:representer to
present a novel SPoG-inspired intuitive interpretation of
nonparametric regression with Laplacian kernels.
To do so, a closed-form expression
for the Laplacian kernel matrix of a circular graph (or
ring) will be derived. This class of graphs has been commonly employed in
the literature to illustrate connections between SPoG and signal
processing of time-domain signals [5].

Up to vertex relabeling, an unweighted circular graph
satisfies . Therefore, its Laplacian matrix can be
written as , where is the rotation matrix resulting
from circularly shifting the columns of one
position to the right, i.e.,
. Matrix is *circulant* since its
-th row can be obtained by circularly shifting the
-st row one position to the right. Hence, can
be diagonalized by the standard Fourier matrix [32], meaning

(17) |

where is the unitary inverse discrete Fourier transform matrix and . Matrices and replace and since, for notational brevity, the eigendecomposition (17) involves complex-valued eigenvectors and the eigenvalues have not been sorted in ascending order.

From (14), a Laplacian kernel matrix is given by , where has entries . It can be easily seen that , where

If , one has that

(18) |

Recall that prop:representer dictates , where . Since and because is periodic in with period , it follows that the vectors are all circularly shifted versions of each other. Moreover, since is positive semidefinite, the largest entry of is precisely the -th one, which motivates interpreting as an interpolating signal centered at , which in turn suggests that the expression can be thought of as a reconstruction equation. From this vantage point, signals play an analogous role to sinc functions in signal processing of time-domain signals. Examples of these interpolating signals are depicted in Fig. 1.

#### Iii-B2 Bandlimited kernels

A number of signal reconstruction approaches in the
SPoG literature deal with graph bandlimited signals; see
e.g. [12, 13, 14, 15, 17, 16, 18, 11].
Here, the notion of *bandlimited kernel* is
introduced to formally show that the LS estimator for bandlimited
signals [12, 13, 14, 15, 16, 11]
is a limiting case of the kernel ridge regression estimate
from (13). This notion will come handy
in Secs. IV and V to estimate the
bandwidth of a bandlimited signal from the observations
.

Signal is said to be
*bandlimited* if it admits an expansion (16) with
supported on a set ; that is,

(19) |

where contains the columns of
with indexes in , and is a
vector stacking . The *bandwidth* of can be defined as the
cardinality , or, as the greatest element of
.

If is bandlimited, it follows from (1) that for some . The LS estimate of is therefore given by [12, 13, 14, 15, 16, 11]

(20a) | ||||

(20b) |

where the second equality assumes that is invertible, a necessary and sufficient condition for the entries of to be identifiable.

The estimate in (20) can be accommodated in the
kernel regression framework by properly constructing a
*bandlimited kernel*. Intuitively, one
can adopt a Laplacian kernel for which is
large if (cf.
Sec. III-B1). Consider the Laplacian kernel
with

(21) |

For large , this function strongly penalizes (cf. (15)), which promotes bandlimited estimates. The reason for setting for instead of is to ensure that is non-singular, a property that simplifies the statement and the proofs of some of the results in this paper.

###### Proposition 1.

###### Proof:

See Appendix C. ∎

th:blandkernel shows that the framework of kernel-based regression subsumes LS estimation of bandlimited signals. A non-asymptotic counterpart of th:blandkernel can be found by setting for in (21), and noting that if . Note however that imposing renders a degenerate kernel-based estimate.

#### Iii-B3 Covariance kernels

So far, signal has been assumed deterministic, which precludes accommodating certain forms of prior information that probabilistic models can capture, such as domain knowledge and historical data. A probabilistic interpretation of kernel methods on graphs will be pursued here to show that: (i) the optimal in the MSE sense for ridge regression is the covariance matrix of ; and, (ii) kernel-based ridge regression seeks an estimate satisfying a system of local LMMSE estimation conditions on a Markov random field [33, Ch. 8].

Suppose without loss of generality that

are zero-mean random variables. The LMMSE estimator of

given is the linear estimator minimizing , where the expectation is over all and noise realizations. With , the LMMSE estimate is given by(22) |

where

denotes the noise variance. Comparing (

22) with (13) and recalling that , it follows that with and . In other words, the similarity measure embodied in the kernel map is just the covariance . A related observation was pointed out in [34] for general kernel methods.In short, one can interpret kernel ridge regression as the LMMSE estimator of a signal with covariance matrix equal to . This statement generalizes [13, Lemma 1], which requires to be Gaussian, rank-deficient, and .

Recognizing that kernel ridge regression is a linear estimator, readily establishes the following result.

###### Proposition 2.

prop:covkernel If , where denotes the estimator in (13), with kernel matrix , and regularization parameter , it then holds that

for all kernel matrices and .

Thus, for criteria aiming to minimize the MSE,
prop:covkernel suggests to be chosen
*close* to . This observation may be employed for kernel
selection and for parameter tuning in graph signal reconstruction
methods of the kernel ridge regression family (e.g. the Tikhonov
regularized estimators
from [12, 22, 4]
and [23, eq. (27)]), whenever an estimate of
can be obtained from historical data. For instance, the
function involved in Laplacian kernels can be chosen such that
resembles in some sense. Investigating such
approaches goes beyond the scope of this paper.

A second implication of the connection
between kernel ridge regression and LMMSE estimation involves signal
estimation on Markov random fields [33, Ch. 8].
In this class of graphical models, an edge connects
with if and
are *not* independent given
. Thus, if , then and
are independent given
. In other words, when
is known for all neighbors , function values at non-neighboring vertices
do not provide further information. This spatial Markovian property
motivates the name of this class of graphical models. Real-world
graphs obey this property when the topology captures direct
interaction, in the sense that the interaction between the entities
represented by two non-neighboring vertices and
is necessarily through vertices in a *path*
connecting with .

###### Proposition 3.

prop:markov Let be a Markov random field, and consider the estimator in (13) with , and . Then, it holds that

(23) |

for , where denotes the sample index of the observed vertex , i.e., , and

Here, is the LMMSE estimator of given , , and is the variance of this estimator.

###### Proof.

See Appendix D. ∎

If a (noisy) observation of at is not
available, i.e. , then kernel ridge
regression finds as the LMMSE
estimate of given function values at the
neighbors of . However, since the latter are not
directly observable, their ridge regression estimates are used
instead. Conversely, when is observed, implying that
a sample is available, the sought
estimator subtracts from this value an estimate of the observation noise
. Therefore, the kernel
estimate on a Markov random field seeks an estimate satisfying the
system of *local LMMSE conditions* given by
(23) for .

Remark 1. In prop:markov, the requirement that is
a Markov random field can be relaxed to that of being a
*conditional correlation graph*, defined as a graph where
if
and are
correlated given . Since correlation implies dependence, any
Markov random field is also a conditional correlation graph. A
conditional correlation graph can be constructed from by setting (see e.g. [35, Th. 10.2]).

Remark 2. Suppose that kernel ridge regression is adopted to estimate a function on a certain graph , not necessarily a Markov random field, using a kernel . Then it can still be interpreted as a method applying (23) on a conditional correlation graph and adopting a signal covariance matrix .

#### Iii-B4 Further kernels

Additional signal reconstructors can be interpreted as kernel-based regression methods for certain choices of . Specifically, it can be seen that [23, eq. (27)] is tantamount to kernel ridge regression with kernel

provided that the adjacency matrix is properly scaled so that this inverse exists. Another example is the Tikhonov regularized estimate in [12, eq. (15)], which is recovered as kernel ridge regression upon setting

and letting tend to 0, where can be viewed as a
*high-pass filter* matrix. The role of the term is to ensure that the matrix within brackets is
invertible.

### Iii-C Kernel-based smoothing and graph filtering

When an observation is available per vertex for , kernel methods can still be employed for denoising purposes. Due to the regularizer in (6), the estimate will be a smoothed version of . This section shows how ridge regression smoothers can be thought of as graph filters, and vice versa. The importance of this two-way link is in establishing that kernel smoothers can be implemented in a decentralized fashion as graph filters [4].

Upon setting in (13), one recovers the ridge regression smoother . If is a Laplacian kernel, then

(24) |

where .

To see how (24) relates to a graph filter, recall that the latter is an operator assigning , where [4]

(25a) | ||||

(25b) |

Graph filters can be implemented in a
decentralized fashion since (25a)
involves successive products of by and these products
can be computed at each vertex by just exchanging information with
neighboring vertices. Expression
(25b) can be rewritten in the
*Fourier domain* (cf. Sec. III-B1) as upon defining and . For this reason, the diagonal of
is referred to as the *frequency response* of
the filter.

Comparing (24) with (25b) shows that can be interpreted as a graph filter with frequency response . Thus, implementing in a decentralized fashion using (25a) boils down to solving for the system of linear equations . Conversely, given a filter, a Laplacian kernel can be found so that filter and smoother coincide. To this end, assume without loss of generality that , where ; otherwise, simply scale . Then, given , the sought kernel can be constructed by setting

## Iv Multi-kernel graph signal reconstruction

One of the limitations of kernel methods is their sensitivity to the choice of the kernel. To appreciate this, Fig. 2 depicts the normalized mean-square error (NMSE) when is the square loss and across the parameter of the adopted diffusion kernel (see Sec. III-B1). The simulation setting is described in Sec. V. At this point though, it suffices to stress the impact of on the NMSE and the dependence of the optimum on the bandwidth of .

Similarly, the performance of estimators for
bandlimited signals degrades considerably if the estimator assumes a
frequency support that differs from the actual one. Even for
estimating *low-pass signals*, for which , parameter is unknown in practice. Approaches for
setting were considered
in [11, 16], but they rely solely
on and , disregarding the observations . Note that by adopting the bandlimited kernels
from Sec. III-B2, bandwidth selection boils down to kernel
selection, so both problems will be treated jointly in the
sequel through the lens of kernel-based learning.

This section advocates an MKL approach to
kernel selection in graph signal reconstruction. Two
algorithms with complementary strengths will be developed. Both select
the most suitable kernels within a user-specified *kernel
dictionary*.

### Iv-a RKHS superposition

Since in (6) is determined by , kernel selection is tantamount to RKHS selection. Therefore, a kernel dictionary can be equivalently thought of as an RKHS dictionary , which motivates estimates of the form

(26) |

Upon adopting a criterion that controls sparsity in this expansion, the “best” RKHSs will be selected. A reasonable approach is therefore to generalize (6) to accommodate multiple RKHSs. With selected as the square loss and , one can pursue an estimate by solving

(27) |

Invoking prop:representer per establishes that the minimizers of (27) can be written as

(28) |

for some coefficients . Substituting (28) into (27) suggests obtaining these coefficients as

(29) |

where , and with . Letting , expression (29) becomes

(30) |

Note that the sum in the regularizer of (30) can be interpreted as the -norm of , which is known to promote sparsity in its entries and therefore in (26). Indeed, (30) can be seen as a particular instance of group Lasso [34].

As shown next, (30) can be efficiently solved using the alternating-direction method of multipliers (ADMM) [36]. To this end, rewrite (30) by defining and , and introducing the auxiliary variable , as

(31) |

ADMM iteratively minimizes the *augmented Lagrangian* of
(31) with respect to and
in a block-coordinate descent fashion, and updates the Lagrange
multipliers associated with the equality constraint using gradient
ascent (see [37] and references
therein). The resulting iteration is summarized as
Algorithm 1, where is the augmented
Lagrangian parameter, is the Lagrange multiplier associated with the
equality constraint, and

is the so-called *soft-thresholding* operator [36].

After obtaining from Algorithm 1, the wanted function estimate can be recovered as

(32) |

It is recommended to normalize the kernel matrices in order to prevent imbalances in the kernel selection. Specifically, one can scale such that . If is a Laplacian kernel (see Sec. III-B1), where , one can scale to ensure .

Remark 3. Although criterion (27) is reminiscent of the MKL approach of [34], the latter differs markedly because it assumes that the right-hand side of (26) is uniquely determined given , which allows application of (6) over a direct-sum RKHS with an appropriately defined norm. However, this approach cannot be pursued here since RKHSs of graph signals frequently overlap, implying that their sum is not a direct one (cf. discussion after (5)).

Comments

There are no comments yet.