## I Introduction

Estimating nodal functions/signals over networks is a task emerging in various domains, such as social, brain, and power networks, to name a few. Functions of nodes can represent certain attributes or classes of these nodes. In Facebook for instance, each node represents a person, and the presence of an edge indicates that two persons are friends, while nodal attributes can be age, gender or movie ratings of each person. In financial networks, where each node is a company, with links denoting trade between two companies, the function of the node can represent the category that each company belongs to, e.g., technology-, fashion-, or education-related.

In real-world networks, there are often unavailable nodal function values, due to, e.g., privacy issues. Hence, a topic of great practical importance is to interpolate missing nodal values (class, ranking or function), based on the function values at a subset of observed nodes. Interpolation of nodal function values often relies on the assumption of “smoothness” over the graphs, which implies that neighboring nodes will have similar nodal function values. For example, in social networks, people tend to rate e.g., movies similar to their friends, and in financial networks, companies that trade with each other usually belong to the same category. From this point of view,

function estimation over graphs based on partial observations has been investigated extensively, [1, 2, 3, 4, 5, 6]. Function estimation has been also pursued in the context of semi-supervised learning, e.g., for transductive regression or classification, see e.g.,

[7, 8, 9, 10]. The same task has been studied recently as signal reconstruction over graphs, see e.g., [11, 12, 13, 14, 15], where signal values on unobserved nodes can be estimated by properly introducing a graph-aware prior. Kernel-based methods for learning over graphs offer a unifying framework that includes linear and nonlinear function estimators [13, 16, 17]. The nonlinear methods outperform the linear ones but suffer from the curse of dimensionality

[18], rendering them less attractive for large-scale networks.To alleviate this limitation, a scalable kernel-based approach will be introduced in the present paper, which leverages the random feature approximation to ensure *scalability* while also allowing *real-time* evaluation of the functions over large-scale dynamic networks. In addition, the novel approach incorporates a data-driven scheme for *adaptive* kernel selection.

Adaptive learning over graphs has been also investigated for tracking and learning over possibly dynamic networks, e.g., [19, 17]. Least mean-squares and recursive least-squares adaptive schemes have been developed in [19], without explicitly accounting for evolving network topologies. In contrast, [17] proposed a kernel-based reconstruction scheme to track time-varying signals over time-evolving topologies, but assumed that the kernel function is selected a priori. All these prior works assume that the network size is fixed.

In certain applications however, new nodes may join the network over time. For example, hundreds of new users are joining Facebook or Netflix every day, and new companies are founded in financial networks regularly. Real-time and scalable estimation of the desired functions on these newly-joining nodes is of great importance. While simple schemes such as averaging over one- or multi-hop neighborhoods are scalable to network size by predicting the value on each newly-coming node as a weighted combination of its multi-hop neighborhoods [20], they do
not capture global information over the network. In addition, existing rigorous approaches are in general less efficient in accounting for newly-joining nodes, and need to solve the problem over all nodes, every time new nodes join the network, which incurs complexity , where denotes the network size [13, 16]. As a result, these methods are not amenable to real-time evaluation over newly-joining nodes. To this end, the present paper develops a scalable *online graph-adaptive* algorithm that can efficiently estimate nodal functions on newly-joining nodes ‘on the fly.’

Besides scalability and adaptivity, nodes may have firm *privacy* requirements, and may therefore not be willing to reveal who their neighbors are. However, most graph-based learning methods require knowing the entire connectivity pattern, and thus cannot meet the privacy requirements. The novel random feature based approach on the other hand, only requires an encrypted version of each node’s connectivity pattern, which makes it appealing for networks with stringent privacy constraints.

In short, we put forth a novel online multikernel learning (MKL) framework for effectively learning and tracking nonlinear functions over graphs. Our contributions are as follows.

c1) A scalable MKL approach is developed to efficiently estimate the nodal function values both on the observed and un-observed nodes of a graph;

c2) The resultant algorithm is capable of estimating the function value of newly incoming nodes with high accuracy without having to solve the batch problem over all nodes, making it highly scalable as the network size grows, and suitable for nodal function estimation in dynamic networks;

c3)

Unlike most existing methods that rely on nodal feature vectors in order to learn the function, the proposed scheme simply capitalizes on the connectivity pattern of each node, while at the same time, nodal feature vectors can be easily incorporated if available; and,

c4) The proposed algorithm does not require nodes to share connectivity patterns. Instead, a privacy-preserving scheme is developed for estimating the nodal function values based on an encrypted version of the nodal connectivity patterns, hence respecting node privacy.

The rest of this paper is organized as follows. Preliminaries are in Section II, while Section III presents an online kernel-based algorithm that allows sequential processing of nodal samples. Section IV develops an online MKL scheme for sequential data-driven kernel selection, which allows graph-adaptive selection of kernel functions to best fit the learning task of interest. Finally, results of corroborating numerical tests on both synthetic and real data are presented in Section VI, while concluding remarks along with a discussion of ongoing and future directions are given in Section VII.

Notation. Bold uppercase (lowercase) letters denote matrices (column vectors), while and stand for matrix transposition, and the

th leading eigenvalue of the matrix argument, respectively. The identity matrix will be represented by

, while will denote the matrix of all zeros, and their dimensions will be clear from the context. Finally, the and Frobenius norms will be denoted by , and , respectively.## Ii Kernel-based learning over graphs

Consider a graph of nodes, whose topology is captured by a known adjacency matrix . Let denote the entry of , which is nonzero only if an edge is present from node to . A real-valued function (or signal) on a graph is a mapping , where is the set of vertices. The value represents an attribute of , e.g., in the context of brain networks, could represent the sample of an electroencephalogram (EEG), or functional magnetic resonance imaging (fMRI) measurement at region . In a social network, could denote the age, political alignment, or annual income of the th person. Suppose that a collection of noisy samples is available, where models noise, and represents the number of measurements. Given , and with the graph topology known, the goal is to estimate , and thus reconstruct the graph signal at unobserved vertices. Letting , the observation vector obeys

(1) |

where , , and is a sampling matrix with binary entries for , and , elsewhere.

Given , , and , the goal is to estimate over the entire network. To tackle the under-determined system (1), consider function belonging to a reproducing kernel Hilbert space (RKHS) defined as [16, 13]

(2) |

where is a pre-selected kernel function. Hereafter, we will let for notational convenience, and without loss of generality (wlog). Given , the RKHS-based estimate is formed as

(3) |

where the cost can be selected depending on the learning task, e.g., the least-squares (LS) for regression, or the logistic loss for classification; is the RKHS norm; is an increasing function; and, is a regularization parameter that copes with overfitting. According to the definition of graph RKHS in (2), the function estimate can be written as , where collects the basis coefficients, and . Substituting into the RKHS norm, we find , where the kernel matrix has entries ; thus, the functional problem (3) boils down to

(4) |

According to the representer theorem, the optimal solution of (3) admits the finite-dimensional form given by [16, 13]

(5) |

where , and . This means that the coefficients corresponding to the unobserved nodes are all zeros. This implies that the function over the graph can be estimated by optimizing over the vector [cf. (3)]

(6) |

where . For general kernel-based learning tasks, is formed using the nonlinear functions of pairwise correlations , where denotes the feature vector of node , which can collect, for example, the buying history of users on Amazon, or the trading history of companies in financial networks. However, such information may not be available in practice, due to, e.g., privacy concerns. This has motivated the graph-kernel based approaches in [13] and [16], to reconstruct the graph signal when only the network structure is available, and the kernel matrix is selected as a nonlinear function of the graph Laplacian matrix. Specifically, these works mostly consider undirected networks, .

Given the normalized Laplacian matrix , with , and letting , the family of graphical kernels is

(7) |

where is a non-decreasing scalar function of the eigenvalues, and denotes pseudo-inverse. By selecting , different graph properties can be accounted for, including smoothness, band-limitedness, the random walk [16], and diffusion [2].

Although graph-kernel based methods are effective in reconstructing signals over graphs, it can be observed from (7) that formulating generally requires an eigenvalue decomposition of , which incurs complexity that can be prohibitive for large-scale networks. Moreover, even though nodal feature vectors are not necessary to form , the graph-kernel-based scheme requires knowledge of the topology, meaning , in order to estimate the nodal function of each node. However, in networks with strict privacy requirements, nodes may not be willing to share such information with others. In Facebook, for example, most people do not make their friend list public. In addition, solving (4) assumes that all sampled nodes are available in batch, which may not be true in scenarios where nodes are sampled in a sequential fashion.

In response to these challenges, an online scalable kernel-based method will be developed in the ensuing section to deal with sequentially obtained data samples, over generally dynamic networks. The resultant algorithm only requires encrypted versions of the nodal connectivity patterns of other nodes, and hence it offers privacy.

## Iii Online kernel-based learning over graphs

Instead of resorting to a graph kernel that requires an eigenvalue decomposition of in (7), the present section advocates treating the *connectivity pattern of each node as its feature vector*, which can be the th column and possibly the th row of the adjacency (if is nonsymmetric). We will henceforth term this the *connectivity pattern* of , and denote it as , for brevity. Given , we will interpolate unavailable nodal function values using a nonparametric approach, that is different and scalable relative to [16] and [13]. The kernel matrix is now

(8) |

Again, with nodes sampled, the representer theorem asserts that the sought function estimator has the form [18]

(9) |

where . It can be observed from (9) that involves the adjacency of the entire network, namely , which leads to potentially growing complexity as the number of sampled nodes increases [18].

### Iii-a Batch RF-based learning over graphs

To bypass this growing complexity, we will resort to the so-called random feature approximation [21] in order to reduce the original functional learning task in (4) to a problem with the number of unknown parameters not growing with . We first approximate in (5) using random features (RFs) [21, 22] that are obtained from a shift-invariant kernel satisfying . For

absolutely integrable, its Fourier transform

exists and represents the power spectral density, which upon normalizing to ensure, can also be viewed as a probability density function (pdf); hence,

(10) |

where the last equality is due to the definition of the expected value. Drawing a sufficient number of independent and identically distributed samples from , the ensemble mean (III-A) can be approximated by the sample average

(11) |

where , and denotes the *real-valued* RF vector

(12) | ||||

Taking expectations in (11) and using (III-A), one can verify that , which means is unbiased. Note that finding requires an -dimensional Fourier transform of , which in general requires numerical integration. Nevertheless, it has been shown that for a number of popular kernels, is available in closed form [21]. Taking the Gaussian kernel as an example, where , it has a Fourier transform corresponding to the pdf .

Hence, the function that is optimal in the sense of (3) can be cast to a function approximant over the -dimensional RF space (cf. (9) and (11))

(13) |

where . While in (5) is the superposition of nonlinear functions , its RF approximant in (13) is a linear function of . As a result, (3) reduces to

(14) |

where . A batch solver of (14) has complexity that does not grow with . This batch RF-based approach scales linearly with the number of measured nodes , and the number of variables is , which does not depend on . This allows us to pursue an online implementation as elaborated next.

### Iii-B Online RF-based learning over graphs

Here, we will further leverage RF-based learning over graphs to enable real-time learning and reconstruction of signals evolving over possibly dynamic networks. A scalable online algorithm will be introduced, which can adaptively handle sequentially sampled nodal features and update the sought function estimates.

Training sequentially. In the training phase, we are given a network of nodes, and the nodal function is sampled in a sequential fashion. Letting denote the node sampled at the th time slot, and having available at , the online inference task can be written as [cf. (14)]

(15) | ||||

We will solve (15) using online gradient descent [23]. Obtaining per slot , the RF of its connectivity pattern is formed as in (12), and is updated ‘on the fly,’ as

(16) |

where is the sequence of stepsizes that can tune learning rates.
In this paper, we will adopt for simplicity.
Iteration (16) provides *a functional update* since . The per-iteration complexity of (16) is , and for the entire training process, which scales better than that is required for a batch solver of (14).

Inferring unavailable nodal values. After the training phase, the nodal function value over the un-sampled nodes can be readily estimated by [cf. (13)]

(17) |

where is the final estimate after the training phase, i.e., , and denotes the index set of the nodes whose signal values have not been sampled in the training phase.

Newly-joining nodes. When new nodes join the network, batch graph-kernel based approaches must expand in (7) by one row and one column, and re-solve (6) in order to form signal estimates for the newly-joining nodes. Hence, each newly joining node will incur complexity . The novel online RF method on the other hand, can simply estimate the signal on the newly coming node via , where denotes the connectivity pattern of the new node with the *existing* nodes in the network. This leads to a complexity of per new node. If in addition, is available, then the function estimate can also be efficiently updated via (16) and (13) using and .

The steps of our online RF-based method are summarized in Algorithm 1. A couple of standard learning tasks where Algorithm 1 comes handy are now in order.

*Nonlinear regression over graphs.*
Consider first nonlinear regression over graphs, where the goal is to find a nonlinear function , such that given the graph adjacency matrix . The criterion is to minimize the regularized prediction error of , typically using the online LS loss in (15), whose gradient is (cf. (21))

In practice, can represent a noisy version of each node’s real-valued attribute, e.g., temperature in a certain city, and the graph can be constructed based on Euclidean distances among cities. For a fair comparison with alternatives, only the regression task will be tested in the numerical section of this paper.

*Nonlinear classification over graphs.*

We can also handle kernel-based perceptron and kernel-based logistic regression, which aim at learning a nonlinear classifier that best approximates either

, or, the pdf of conditioned on . With binary labels , the perceptron solves (3) with , which equals zero if , and otherwise equals . In this case, the gradient of the presented online RF-based method is (cf. (21))Accordingly, given , logistic regression postulates that . Here the gradient takes the form (cf. (21))

Classification over graphs arises in various scenarios, where may represent categorical attributes such as gender, occupation or, nationality of users in a social network.

Remark 1 (Privacy). Note that the update in (16) does not require access to directly. Instead, the only information each node needs to reveal is for each , which involves . Being noninvertible, these co-sinusoids functions involved in generating the can be viewed as an encryption of the nodal connectivity pattern, which means that given , vector cannot be uniquely deciphered. Hence, Algorithm 1 preserves privacy.

Remark 2 (Directed graphs). It can be observed from (7) that for to be a valid kernel, graph-kernel based methods require , and henceforth to be symmetric, which implies they can only directly deal with symmetric/undirected graphs. Such a requirement is not necessary for our RF-based method.

Remark 3 (Dynamic graphs). Real-world networks may vary over time, as edges may disappear or appear. To cope with such changing topologies, the original graph-kernel method needs to recalculate the kernel matrix, and resolve the batch problem whenever one edge changes. In contrast, our online RF-based method can simply re-estimate the nodal values on the two ends of the (dis)appeared edge using (13) with their current .

Evidently, the performance of Algorithm 1 depends on that is so far considered known. As the “best” performing is generally unknown and application dependent, it is prudent to adaptively select kernels by superimposing multiple kernels from a prescribed dictionary, as we elaborate next.

## Iv Online Graph-adaptive MKL

In the present section, we develop an online graph-adaptive learning approach that relies on random features, and leverages multi-kernel approximation to estimate the desired based on sequentially obtained nodal samples over the graph. The proposed method is henceforth abbreviated as Gradraker.

The choice of is critical for the performance of single kernel based learning over graphs, since different kernels capture different properties of the graph, and thus lead to function estimates of variable accuracy [13]. To deal with this, combinations of kernels from a preselected dictionary can be employed in (3); see also [13, 22]. Each combination belongs to the convex hull . With denoting the RKHS induced by , one then solves (3) with replaced by , where represent the RKHSs corresponding to [24].

The candidate function is expressible in a separable form as , where belongs to , for . To add flexibility per kernel in our ensuing online MKL scheme, we let wlog , and seek functions of the form

(18) |

where , and the normalized weights satisfy , and . Exploiting separability jointly with the RF-based function approximation per kernel, the MKL task can be reformulated, after letting in (15), as

(19a) | ||||

(19b) | ||||

(19c) |

which can be solved efficiently ‘on-the-fly.’ Relative to (14), we replaced by to introduce the notion of time, and stress the fact that the nodes are sampled sequentially.

Given the connectivity pattern of the th sampled node , an RF vector is generated per from the pdf via (12), where for notational brevity. Hence, per kernel and node sample , we have [cf. (13)]

(20) |

and as in (16), is updated via

(21) |

with chosen constant to effect the adaptation. As far as is concerned, since it resides on the probability simplex, a multiplicative update is well motivated as discussed also in, e.g., [23, 22]. For the un-normalized weights, this update is available in closed form as [22]

(22) |

Having found as in (22), the normalized weights in (18) are obtained as . Note from (22) that when has a larger loss relative to other with for the th sampled node, the corresponding decreases more than the other weights. In other words, a more accurate approximant tends to play a more important role in predicting the ensuing sampled node. In summary, our Gradraker for online graph MKL is listed as Algorithm 2.

Remark 4 (Comparison with batch MKL). A batch MKL based approach for signal reconstruction over graphs was developed in [13]. It entails an iterative algorithm whose complexity grows with in order to jointly estimate the nodal function, and to adaptively select the kernel function. When new nodes join the network, [13] re-calculates the graphical kernels and re-solves the overall batch problem, which does not scale with the network size. In addition, [13] is not privacy preserving in the sense that in order to estimate the function at any node, one needs to have access to the connectivity pattern of the entire network.

Remark 5 (Comparison with -NN). An intuitive yet efficient way to predict function values of a newly joining node is to simply combine the values of its nearest neighbors (-NN) [20, 25]. Efficient as it is, -NN faces several challenges: a) At least one of the neighbors must be labeled, which does not always hold in practice, and is not required by the Gradraker; and b) -NN can only account for local information, while the Gradraker takes also into account the global information of the graph.

### Iv-a Generalizations

So far, it is assumed that each node only has available its own connectivity feature vector . This allows Gradraker to be applied even when limited information is available about the nodes, which many existing algorithms that rely on nodal features cannot directly cope with.

If additional feature vectors are actually available per node other than its own , it is often not known a priori which set of features is the most informative for estimating the signal of interest on the graph. To this end, the novel Gradraker can be adapted by treating the functions learned from different sets of features as *an ensemble of learners*, and combine them in a similar fashion as in (18), that is,

(23) |

Applications to several practical scenarios are discussed in the following.

Semi-private networks. In practice, a node may tolerate sharing its links to its neighbors, e.g., users of Facebook may share their friends-list with friends. In this scenario, each node not only knows its own neighbors, but also has access to who are its neighbors’ neighbors, i.e., two-hop neighbors. Specifically, node has access to , as well as to the th column of [1], and a learner can henceforth be introduced and combined in (23). Moreover, when nodes are less strict about privacy, e.g., when a node is willing to share its multi-hop neighbors, more learners can be introduced and combined ‘on the fly’ by selecting as the th column of in (23).

Multilayer networks.
Despite their popularity, ordinary networks are often inadequate to describe increasingly complex systems. For instance, modeling interactions between two individuals using a single edge can be a gross simplification of reality. Generalizing their *single-layer* counterparts, *multilayer networks* allow nodes to belong to groups, called layers [26, 27]. These layers could represent different attributes or characteristics of a complex system, such as temporal snapshots of the same network, or different types of groups in social networks (family, soccer club, or work related). Furthermore, multilayer networks are able to model systems that typically cannot be represented by traditional graphs, such as heterogeneous information networks [28, 29].
To this end, Gradraker can readily incorporate the information collected from heterogenous sources, e.g., connectivity patterns from different layers, by adopting a kernel based learner on the th layer and combining them as in (23).

Nodal features available. In certain cases, nodes may have nodal features [1] in addition to their . For example, in social networks, other than the users’ connectivity patterns, we may also have access to their shopping history on Amazon. In financial networks, in addition to the knowledge of trade relationships with other companies, there may be additional information available per company, e.g., the number of employees, category of products the company sales, or the annual profit. Gradraker can also incorporate this information by introducing additional learners based on the nodal feature vectors, and combine them as in (23).

## V Performance analysis

To analyze the performance of the novel Gradraker algorithm, we assume that the following are satisfied.

(as1)
*For all sampled nodes , the loss function in (15) is convex w.r.t. .*

(as2)
*For belonging to a bounded set with , the loss is bounded; that is, , and has bounded gradient, meaning, .*

(as3)
*The kernels are shift-invariant, standardized, and bounded, that is, ; and w.l.o.g. they also have bounded entries, meaning .*

Convexity of the loss under (as1) is satisfied by the popular loss functions including the square loss and the logistic loss. As far as (as2), it ensures that the losses, and their gradients are bounded, meaning they are -Lipschitz continuous. While boundedness of the losses commonly holds since is bounded, Lipschitz continuity is also not restrictive. Considering kernel-based regression as an example, the gradient is . Since the loss is bounded, e.g., , and the RF vector in (12) can be bounded as , the constant is using the Cauchy-Schwartz inequality. Kernels satisfying the conditions in (as3) include Gaussian, Laplacian, and Cauchy [21]. In general, (as1)-(as3) are standard in online convex optimization (OCO) [30, 23], and in kernel-based learning [31, 21, 32].

In order to quantify the performance of Gradraker, we resort to the static regret metric, which quantifies the difference between the aggregate loss of an OCO algorithm, and that of the best fixed function approximant in hindsight, see also e.g., [30, 23]. Specifically, for a sequence obtained by an online algorithm , its static regret is

(24) |

where will henceforth be replaced by for notational brevity; and, is defined as the batch solution

(25) |

where , with representing the RKHS induced by . We establish the regret of our Gradraker approach in the following lemma.

###### Lemma 1

Under (as1), (as2), and with defined as , with , for any , the sequences and generated by Gradraker satisfy the following bound

(26) |

where is associated with the best RF function approximant .

Proof: See Appendix A

In addition to bounding the regret in the RF space, the next theorem compares the Gradraker loss relative to that of the best functional estimator in the original RKHS.

###### Theorem 2

Proof: See Appendix B

Observe that the probability of (2) to hold grows as increases, and one can always find a to ensure a positive probability for a given . Theorem 2 establishes that with a proper choice of parameters, the Gradraker achieves sub-linear regret relative to the best static function approximant in (V), which means the novel Gradraker algorithm is capable of capturing the nonlinear relationship among nodal functions accurately, as long as enough nodes are sampled sequentially.

In addition, it is worth noting that Theorem 2 holds true regardless of the sampling order of the nodes . However, optimizing over the sampling pattern is possible, and constitutes one of our future research directions.

## Vi Numerical tests

In this section, Gradraker is tested on both synthetic and real datasets to corroborate its effectiveness. The tests will mainly focus on regression tasks for a fair comparison with existing alternatives.

### Vi-a Synthetic data test

Data generation. An Erdös-Rényi graph [33] with binary adjacency matrix was generated with probability of edge presence , and its adjacency was symmetrized as . This symmetrization is not required by Gradraker, but it is necessary for alternative graph kernel based methods. A function over this graph was then generated with each entry of the coefficient vector drawn uniformly from , and each entry of the noise drawn from . In each experiment, the sampling matrix is randomly generated so that of the nodes are randomly sampled, and the remaining nodes are treated as newly-joining nodes, whose function values and connectivity patterns are both unknown at the training phase, and whose nodal function values are estimated based on their connectivity with existing nodes in the network during the testing phase. All algorithms are carried out on the training set of nodes, and the obtained model is used to estimate the function value on the newly arriving nodes. The runtime for estimating the function value on the newly-joining nodes, as well as the generalization