# Distributed Adaptive Learning of Graph Signals

The aim of this paper is to propose distributed strategies for adaptive learning of signals defined over graphs. Assuming the graph signal to be bandlimited, the method enables distributed reconstruction, with guaranteed performance in terms of mean-square error, and tracking from a limited number of sampled observations taken from a subset of vertices. A detailed mean square analysis is carried out and illustrates the role played by the sampling strategy on the performance of the proposed method. Finally, some useful strategies for distributed selection of the sampling set are provided. Several numerical results validate our theoretical findings, and illustrate the performance of the proposed method for distributed adaptive learning of signals defined over graphs.

## Authors

• 1 publication
• 1 publication
• 1 publication
• 1 publication
• ### Weighted diffusion LMP algorithm for distributed estimation in non-uniform noise conditions

This letter presents an improved version of diffusion least mean ppower ...
08/06/2016 ∙ by H. Zayyani, et al. ∙ 0

• ### Random sampling of bandlimited signals on graphs

We study the problem of sampling k-bandlimited signals on graphs. We pro...
11/16/2015 ∙ by Gilles Puy, et al. ∙ 0

• ### Acceleration of Cooperative Least Mean Square via Chebyshev Periodical Successive Over-Relaxation

A distributed algorithm for least mean square (LMS) can be used in distr...

• ### Investigating Alternatives to the Root Mean Square for Adaptive Gradient Methods

06/10/2021 ∙ by Brett Daley, et al. ∙ 0

• ### MMSE Channel Estimation for Two-Port Demodulation Reference Signals in New Radio

Two-port demodulation reference signals (DMRS) have been employed in new...
07/28/2020 ∙ by Dejin Kong, et al. ∙ 0

• ### Weighted Edge Sampling for Static Graphs

Graph Sampling provides an efficient yet inexpensive solution for analyz...
10/18/2019 ∙ by Muhammad Irfan Yousuf, et al. ∙ 0

• ### Greedy Sampling of Graph Signals

Sampling is a fundamental topic in graph signal processing, having found...
04/05/2017 ∙ by Luiz F. O. Chamon, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Over the last few years, there was a surge of interest in the development of processing tools for the analysis of signals defined over a graph, or graph signals for short, in view of the many potential applications spanning from sensor networks, social media, vehicular networks, big data or biological networks [1, 2, 3]. Graph signal processing (GSP) considers signals defined over a discrete domain having a very general structure, represented by a graph, and subsumes classical discrete-time signal processing as a very simple case. Several processing methods for signals defined over a graph were proposed in [2], [4, 5, 6]

, and one of the most interesting aspects is that these analysis tools come to depend on the graph topology. A fundamental role in GSP is of course played by spectral analysis, which passes through the definition of the Graph Fourier Transform (GFT). Two main approaches for GFT have been proposed in the literature, based on the projection of the signal onto the eigenvectors of either the graph Laplacian, see, e.g.,

[1], [7], [8], or of the adjacency matrix, see, e.g. [2], [9]. The first approach is more suited to handle undirected graphs and builds on the clustering properties of the graph Laplacian eigenvectors and the minimization of the norm graph total variation; the second approach applies also to directed graphs and builds on the interpretation of the adjacency operator as a graph shift operator, which paves the way for all linear shift-invariant filtering methods for graph signals [10], [11].

One of the basic and interesting problems in GSP is the development of a sampling theory for signals defined over graphs, whose aim is to recover a bandlimited (or approximately bandlimited) graph signal from a subset of its samples. A seminal contribution was given in [7], later extended in [12] and, very recently, in [9], [13], [14], [15], [16]. Several reconstruction methods have been proposed, either iterative as in [14], [17], or single shot, as in [9], [18], [13]. Frame-based approaches for the reconstruction of graph signals from subsets of samples have also been proposed in [7], [13], [14]. Furthermore, as shown in [9], [13], dealing with graph signals, the recovery problem may easily become ill-conditioned, depending on the location of the samples. Thus, for any given number of samples, the sampling set plays a fundamental role in the conditioning of the recovery problem. This makes crucial to search for strategies that optimize the selection of the sampling set over the graph. The theory developed in the last years for GSP was then applied to solve specific learning tasks, such as semi-supervised classification on graphs [19], graph dictionary learning [20], smooth graph signal recovery from random samples [21, 22, 23, 24], inpainting [25], denoising [26], and adaptive estimation [27].

Almost all previous art considers centralized processing methods for graph signals. In many practical systems, data are collected in a distributed network, and sharing local information with a central processor is either unfeasible or not efficient, owing to the large size of the network and volume of data, time-varying network topology, bandwidth/energy constraints, and/or privacy issues. Centralized processing also calls for sufficient resources to transmit the data back and forth between the nodes and the fusion center, which limits the autonomy of the network, and may raise robustness concerns as well, since the central processor represents a bottleneck and an isolate point of failure. In addition, a centralized solution may limit the ability of the nodes to adapt in real-time to time-varying scenarios. Motivated by these observations, in this paper we focus on distributed techniques for graph signal processing. Some distributed methods were recently proposed in the literature, see, e.g. [28, 29, 30]. In [28], a distributed algorithm for graph signal inpainting is proposed; the work in [29] considers distributed processing of graph signals exploiting graph spectral dictionaries; finally, reference [30] proposes a distributed tracking method for time-varying bandlimited graph signals, assuming perfect observations (i.e., there is no measurement noise) and a fixed sampling strategy.

Contributions of the paper: In this work, we propose distributed strategies for adaptive learning of graph signals. The main contributions are listed in the sequel.

1. We formulate the problem of distributed learning of graph signals exploiting a probabilistic sampling scheme over the graph;

2. We provide necessary and sufficient conditions for adaptive reconstruction of the signal from the graph samples;

3. We apply diffusion adaptation methods to solve the problem of learning graph signals in a distributed manner. The resulting algorithm is a generalization of diffusion adaptation strategies where nodes sample data from the graph with some given probability.

4. We provide a detailed mean square analysis that illustrates the role of the probabilistic sampling strategy on the performance of the proposed algorithm.

5. We design useful strategies for the distributed selection of the (expected) sampling set. To the best of our knowledge, this is the first strategy available in the literature for distributed selection of graph signal’s samples.

The work merges, for the first time in the literature, the well established field of adaptation and learning over networks, see, e.g., [31, 32, 33, 34, 35, 36, 37, 38, 39], with the emerging area of signal processing on graphs, see, e.g., [1, 2, 3]. The proposed method exploits the graph structure that describes the observed signal and, under a bandlimited assumption, enables adaptive reconstruction and tracking from a limited number of observations taken over a subset of vertices in a totally distributed fashion. Interestingly, the graph topology plays an important role both in the processing and communication aspects of the algorithm. A detailed mean-square analysis illustrates the role of the sampling strategy on the reconstruction capability, stability, and performance of the proposed algorithm. Thus, based on these results, we also propose a distributed method to select the set of sampling nodes in an efficient manner. An interesting feature of our proposed strategy is that this subset is allowed to vary over time, provided that the expected sampling set satisfies specific conditions enabling signal reconstruction. We expect that the proposed tools will represent a key technology for the distributed proactive sensing of cyber physical systems, where an effective control mechanism requires the availability of data-driven sampling strategies able to monitor the overall system by only checking a limited number of nodes.

The paper is organized as follows. In Sec. II, we introduce some basic GSP tools. Sec. III introduces the proposed distributed algorithm for adaptive learning of graph signals, illustrating also the conditions enabling signal reconstruction from a subset of samples. In Sec. IV we carry out a detailed mean-square analysis, whereas Sec. V is devoted to the development of useful strategies enabling the selection of the sampling set in a totally distributed fashion. Then, in Sec. VI we report several numerical simulations, aimed at assessing the validity of the theoretical analysis and the performance of the proposed algorithm. Finally, Sec. VII draws some conclusions.

## Ii Graph Signal Processing Tools

In this section, we introduce some useful concepts from GSP that will be exploited along the paper. Let us consider a graph composed of nodes , along with a set of weighted edges , such that , if there is a link from node to node , or , otherwise. The adjacency matrix is the collection of all the weights . The degree of node is , and the degree matrix is a diagonal matrix having the node degrees on its diagonal. The Laplacian matrix is defined as: . If the graph is undirected, the Laplacian matrix is symmetric and positive semi-definite, and admits the eigendecomposition , where collects all the eigenvectors of in its columns, whereas

contains the eigenvalues of

. It is well known from spectral graph theory [40] that the eigenvectors of

are well suited for representing clusters, since they are signal vectors that minimize the

-norm graph total variation.

A signal over a graph is defined as a mapping from the vertex set to the set of complex numbers, i.e. . In many applications, the signal admits a compact representation, i.e., it can be expressed as:

 \mbox{\boldmathx}=U\mbox{\boldmaths} (1)

where is exactly (or approximately) sparse. As an example, in all cases where the graph signal exhibits clustering features, i.e. it is a smooth function within each cluster, but it is allowed to vary arbitrarily from one cluster to the other, the representation in (1) is compact, i.e.,

is sparse. A key example is cluster analysis in semi-supervised learning, where a constant signal (label) is associated to each cluster

[41]. The GFT of a signal is defined as the projection onto the orthogonal set of eigenvectors of the Laplacian [1], i.e.,

 \mbox{\boldmaths}=UH\mbox{\boldmathx}. (2)

The GFT has been defined in alternative ways, see, e.g., [1], [2], [8], [9]. In this paper, we basically follow the approach based on the Laplacian matrix, assuming an undirected graph structure, but the theory could be extended to handle directed graphs considering, e.g., a graph Fourier basis as proposed in [2]. Also, we denote the support of in (1) as , and the bandwidth of the graph signal is defined as the cardinality of , i.e. . Clearly, combining (1) with (2), if the signal exhibits a clustering behavior, in the sense specified above, the GFT is the way to recover the sparse vector . Finally, given a subset of vertices , we define a vertex-limiting operator as the matrix

 D\mbox{S}=diag{\mbox{\boldmath1}\mbox{S}}, (3)

where is the set indicator vector, whose -th entry is equal to one, if , or zero otherwise.

## Iii Distributed Learning of Graph Signals

We consider the problem of learning a (possibly time-varying) graph signal from observations taken from a subset of vertices of the graph. The problem fits well, e.g., to a wireless sensor network (WSN) scenario, where the nodes are observing a spatial field related to some physical parameter of interest. Let us assume that the field is either fixed or slowly varying over both the time domain and the vertex (space) domain. Suppose now that the WSN is equipped with nodes that, at every time instant, can decide wether to take (noisy) observations of the underlying signal or not, depending on, e.g., energy constraints, failures, limited memory and/or processing capabilities, etc. Our purpose is to build adaptive techniques that allow the recovery of the field values at each node, pursued using recursive and distributed techniques as new data arrive. In this way, the information is processed on the fly by all nodes and the data diffuse across the network by means of a real-time sharing mechanism.

Let us consider a signal defined over the graph . To enable sampling of without loss of information, the following is assumed:

Assumption 1 (Bandlimited): The signal is -bandlimited on the (time-invariant) graph , i.e., its spectral content is different from zero only on the set of indices .

Under Assumption 1, if the support is known beforehand, from (1), the graph signal can be cast in compact form as:

 \mbox{\boldmathx}o=U\mbox{F}\mbox{\boldmaths}o, (4)

where collects the subset of columns of matrix in (1) associated to the frequency indices , and is the vector of GFT coefficients of the frequency support of the graph signal . Let us assume that streaming and noisy observations of the graph signal are sampled over a (possibly time-varying) subset of vertices. In such a case, the observation taken by node at time can be expressed as:

 yi[n]=di[n](xoi+vi[n])=di[n](\mbox{\boldmathc}Hi\mbox{\boldmaths}o+vi[n]), (5)

, where denotes complex conjugate-transposition; is a random sampling binary coefficient, which is equal to 1 if node is taking the observation at time , and 0 otherwise;

is a zero-mean, spatially and temporally independent observation noise, with variance

; also, in (5) we have used (4), where denotes the -th row of matrix . In the sequel, we assume that each node has local knowledge of its corresponding regression vector in (5). This is a reasonable assumption even in the distributed scenario considered in this paper. For example, if neighbors in the processing graph can communicate with each other, either directly or via multi-hop routes, there exist many techniques that enable the distributed computation of eigenparameters of matrices describing sparse topologies such as the Laplacian or the adjacency, see, e.g., [42, 43, 44]. The methods are mainly based on the iterative application of distributed power iteration and consensus methods in order to iteratively compute the desired eigenparameters of the Laplacian (or adjacency) matrix, see, e.g., [44] for details. Since we consider graph signals with time-invariant topology, such procedures can be implemented offline during an initialization phase of the network to compute the required regression vectors in a totally distributed fashion. In the case of time-varying graphs, the distributed procedure should be adapted over time, but might result unpractical for large dynamic networks.

The distributed learning task consists in recovering the graph signal from the noisy, streaming, and partial observations in (5) by means of in-network adaptive processing. Following a least mean squares (LMS) estimation approach [31, 34, 35, 45, 36], the task can be cast as the cooperative solution of the following optimization problem:

 minsN∑i=1Ed,v∣∣di[n](yi[n]−\mbox{\boldmathc}Hi\mbox{\boldmaths})∣∣2, (6)

where

denotes the expectation operator evaluated over the random variables

and , and we have exploited for all . In the rest of the paper, to avoid overcrowded symbols, we will drop the subscripts in the expectation symbol referring to the random variables. In the sequel, we first analyze the conditions that enable signal recovery from a subset of samples. Then, we introduce adaptive strategies specifically tailored for the distributed reconstruction of graph signals from a limited number of samples.

### Iii-a Conditions for Adaptive Reconstruction of Graph Signals

In this section, we give a necessary and sufficient condition guaranteeing signal reconstruction from its samples. In particular, assuming the random sampling and observations processes and to be stationary, the solution of problem (6) is given by the vector that satisfies the normal equations:

 (N∑i=1E{di[n]}\mbox{\boldmathc}i\mbox{\boldmathc}Hi)\mbox{\boldmaths}o=N∑i=1E{di[n]yi[n]}\mbox{\boldmathc}i. (7)

Letting , , be the probability that node takes an observation at time , from (7), it is clear that reconstruction of is possible only if the matrix

 N∑i=1pi\mbox{\boldmathc}i\mbox{% \boldmathc}Hi=UH\mbox{F}%$P$U\mbox{F} (8)

is invertible, with denoting a vertex sampling operator as (3), but weighted by the sampling probabilities . Let us denote the expected sampling set by

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\mbox{S}={i=1,…,N|pi>0}.

represents the set of nodes of the graph that collect data with a probability different from zero. From (7) and (8), a necessary condition enabling reconstruction is

 |¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\mbox{S}|≥|\mbox{F}|, (9)

i.e., the number of nodes in the expected sampling set must be greater than equal to the signal bandwidth. However, this condition is not sufficient, because matrix in (8) may loose rank, or easily become ill-conditioned, depending on the graph topology and sampling strategy (defined by and ). To provide a condition for signal reconstruction, we proceed similarly to [16, 13, 27]. Since for all ,

 rank(N∑i=1pi\mbox{\boldmathc}i\mbox{\boldmathc}Hi)=rank⎛⎜⎝∑i∈¯¯¯S\mbox{\boldmathc}i\mbox{\boldmathc}% Hi⎞⎟⎠, (10)

i.e., matrix (8) is invertible if matrix has full rank, where is the vertex-limiting operator that projects onto the expected sampling set . Let us now introduce the operator

 (11)

which projects onto the complement of the expected sampling set, i.e., . Then, exploiting (11), signal reconstruction is possible if

 UH\mbox{F}D¯¯¯SU\mbox{F}=I−UH\mbox{F}D¯¯¯ScU\mbox{F% }

is invertible, i.e., if condition

 (12)

holds true. As shown in [16, 13], condition (12) is related to the localization properties of graph signals: It implies that there are no -bandlimited signals that are perfectly localized over the set . Proceeding as in [13], [27], it is easy to show that condition (12) is necessary and sufficient for signal reconstruction. We remark that, differently from previous works on sampling of graph signals, see, e.g., [7, 12, 9, 13, 14, 15, 16], condition (12) now depends on the expected sampling set. This relaxed condition is due to the iterative nature of the adaptive learning mechanism considered in this paper. As a consequence, the algorithm does not need to collect all the data necessary to reconstruct one-shot the graph signal at each iteration, but can learn acquiring the needed information over time. The only important thing required by condition (12) is that a sufficiently large number of nodes collect data in expectation (i.e., belong to the expected sampling set ). In the sequel, we introduce the proposed distributed algorithm.

In principle, the solution of problem (6) can be computed as the vector that satisfies the linear system in (7

). Nevertheless, in many linear regression applications involving online processing of data, the moments in (

7) may be either unavailable or time-varying, and thus impossible to update continuously. For this reason, adaptive solutions relying on instantaneous information are usually adopted in order to avoid the need to know the signal statistics beforehand. Furthermore, the solution of (7) would require to collect all the data , for all , in a single processing unit that performs the computation. In this paper, our emphasis is on distributed, adaptive solutions, where the nodes perform the reconstruction task via online in-network processing only exchanging data between neighbors. To this aim, diffusion techniques were proposed and studied in literature [31, 32, 33, 46, 47]. In view of their robustness and adaptation properties, diffusion networks have been applied to solve a variety of learning tasks, such as, e.g., resource allocation problems [48], distributed optimization and learning [34], sparse distributed estimation [35, 49, 45], robust system modeling [50], and multi-task networks [37, 38, 39].

In the sequel, we provide an alternative approach to derive diffusion adaptation strategies with respect to the seminal papers [32, 31]. The derivations will be instrumental to introduce the main assumptions that will be exploited in the mean-square analysis, which will be carried out in the next section. In particular, to ensure the diffusion of information over the entire network, the following is assumed:

Assumption 2 (Topology): The communication graph is symmetric and connected; i.e., there exists an undirected path connecting any two vertices of the network.

To derive distributed solution methods for problem (6), let us introduce local copies of the global variable , and recast problem (6) in the following equivalent form:

 min{si}Ni=1N∑i=1E∣∣di[n](yi[n]−\mbox{\boldmathc}Hi%\boldmathsi)∣∣2 (13) subject to\mbox{\boldmaths}i=% \mbox{\boldmaths}jfor all i,j=1,…,N.

Under Assumption 2, it is possible to write the constraints in (13) in a compact manner, introducing the disagreement constraint that enforces consensus among the local variables [51]. To this aim, let us denote with the adjacency matrix of the communication graph among the nodes, such that , if there is a communication link from node to node , or , otherwise. Then, under Assumption 2, problem (13) [and (6)] can be rewritten in the following equivalent form:

 min{si}Ni=1N∑i=1E∣∣di[n](yi[n]−\mbox{\boldmathc}Hi%\boldmathsi)∣∣2 (14) subject to12N∑i=1N∑j=1˜aij∥\mbox{\boldmaths}j−\mbox{\boldmaths}% i∥2≤0.

The Lagrangian for problem (14) writes as:

 L({\mbox{\boldmaths}i}Ni=1,λo)= N∑i=1E∣∣di[n](yi[n]−% \mbox{\boldmathc}Hi\mbox{\boldmaths}i)∣∣2 +λo2N∑i=1N∑j=1˜aij∥\mbox{\boldmaths}j−\mbox{\boldmaths}i∥2 (15)

with denoting the (optimal) Lagrange multiplier associated with the disagreement constraint. At this stage, we do not need to worry about the selection of the Lagrange multiplier , because it will be embedded into a set of coefficients that the designer can choose. Then, we proceed by minimizing the Lagrangian function in (III-B) by means of a steepest descent procedure. Thus, letting be the instantaneous estimate of at node , we obtain: 111A factor of 2 multiplies (III-B) when the data are real. This factor was absorbed into the step-sizes in (III-B).

 \mbox{\boldmaths}i[n+1] =\mbox{\boldmaths}i[n]−μi[∇siL({\mbox{\boldmaths}i[n]}Ni=1,λo)]∗ =\mbox{\boldmaths}i[n]+μiE{di[n]\mbox{\boldmathc}i(yi[n]−\mbox{\boldmathc}Hi\mbox{\boldmaths}i[n])} +μiλo∑Nj=1˜aij(\mbox{\boldmaths}j[n]−\mbox{\boldmaths}i[n]) (16)

for all , where denotes the complex gradient operator, and are (sufficiently small) step-size coefficients. Now, using similar arguments as in [31, 34, 35, 36], we can accomplish the update (III-B) in two steps by generating an intermediate estimate as follows:

 ψi[n]=\mbox{\boldmaths}i[n]+μiE{di[n]\mbox{\boldmathc}i(yi[n]−% \mbox{\boldmathc}Hi\mbox{\boldmaths}i[n])} (17) \mbox{\boldmaths}i[n+1]=ψi[n]+μiλoN∑j=1˜aij(ψj[n]−ψi[n]) (18)

where in (18) we have replaced the variables with the intermediate estimates that are available at the nodes at time , namely, . Such kind of substitutions are typically used to derive adaptive diffusion implementations, see, e.g., [31]. Now, from (18), introducing the coefficients

 wii=1−μiλoN∑j=1˜aij,andwij=μiλo˜aijfor i≠j, (19)

we obtain

 \mbox{\boldmaths}i[n+1]=N∑j=1wijψj[n] (20)

where the coefficients are real, non-negative, weights matching the communication graph and satisfying:

 wij=0for j∉Ni,andW1=1, (21)

where is the matrix with individual entries , and is the extended neighborhood of node , which comprises node itself. Recursion (17) requires knowledge of the moments , which may be either unavailable or time-varying. An adaptive implementation can be obtained by replacing these moments by local instantaneous approximations, say, of the LMS type, i.e. , for all . The resulting algorithm is reported in Table 1, and will be termed as the Adapt-Then-Combine (ATC) diffusion strategy. The first step in (III-B) is an adaptation step, where the intermediate estimate is updated adopting the current observation taken by node , i.e. . The second step is a diffusion step where the intermediate estimates , from the spatial neighbors , are combined through the weighting coefficients . Finally, given the estimate of the GFT at node and time , the last step produces the estimate of the graph signal value at node [cf. (5)]. We remark that by reversing the steps (17) and (18) to implement (III-B), we would arrive at a similar but alternative strategy, known as the Combine-then-Adapt (CTA) diffusion strategy. We continue our discussion by focusing on the ATC algorithm in (III-B); similar analysis applies to CTA [31].

Remark 1: The strategy (III-B) allows each node in the network to estimate and track the (possibly time-varying) GFT of the graph signal . From (III-B), it is interesting to notice that, while sampling nodes (i.e., those such that ) take and process the observations of the graph signal, the role of the other nodes (i.e., those such that ) is only to allow the propagation of information coming from neighbors through a diffusion mechanism that percolates over all the network. From a complexity point of view, at every iteration , the strategy in (III-B) requires that a node performs computations, if , and computations, if . Instead, from a communication point of view, each node in the network must transmit to its neighbors a vector composed of (complex) values at every iteration .

In this work, we assume that processing and communication graphs have in general distinct topologies. We remark that both graphs play an important role in the proposed distributed processing strategy (III-B). First, the processing graph determines the structure of the regression data used in the adaptation step of (III-B). In fact, are the rows of the matrix , whose columns are the eigenvectors of the Laplacian matrix associated with the set of support frequencies . Then, the topology of the communication graph determines how the processed information is spread all over the network through the diffusion step in (III-B). This illustrates how, when reconstructing graph signals in a distributed manner, we have to take into account both the processing and communication aspects of the problem.

In the next section, we analyze the mean-square behavior of the proposed methods, enlightening the role played by the sampling strategy on the final performance.

## Iv Mean-Square Performance Analysis

In this section, we analyze the performance of the ATC strategy in (III-B) in terms of its mean-square behavior. We remark that the analysis carried out in this section differs from classical derivations used for diffusion adaptation algorithms, see, e.g., [36]. First of all, the observation model in (5) is different from classical models generally adopted in the adaptive filtering literature, see, e.g. [52]. Also, due to the sampling operation and the presence of deterministic regressors [cf. (5)], each node cannot reconstruct the signal using only its own data (i.e., using stand-alone LMS adaptation), and must necessarily cooperate with other nodes in order exploit information coming from other parts of the network. These issues require the development of a dedicated (non-trivial) analysis (see, e.g., Theorem 1 and the Appendix) to prove the mean-square stability of the proposed algorithm.

From now on, we view the estimates as realizations of a random process and analyze the performance of the ATC diffusion algorithm in terms of its mean-square behavior. To do so, we introduce the error quantities

 \mbox{\boldmathe}i[n]=\mbox{\boldmaths}i[n]−% \mbox{\boldmaths}o,i=1,…,N,

and the network vector

 \mbox{\boldmathe}[n]=col{\mbox{\boldmathe% }1[n],…,\mbox{\boldmathe}N[n]}. (23)

We also introduce the block diagonal matrix

 M=diag{μ1I|\mbox{F}|,…,μNI|\mbox{F}|}, (24)

the extended block weighting matrix

 ˆW=W⊗I|\mbox{F}|, (25)

where denotes the Kronecker product operation, and the extended sampling operator

 ˆD[n]=diag{d1[n]I|\mbox{F}|,…,dN[n]I|\mbox{F}|}. (26)

We further introduce the block quantities:

 Q =diag{\mbox{\boldmathc}1\mbox{% \boldmathc}H1,…,\mbox{\boldmathc}N\mbox{% \boldmathc}HN}, (27) \mbox{\boldmathg}[n] =col{\mbox{\boldmathc}1v1[n],…,% \mbox{\boldmathc}NvN[n]}. (28)

Then, exploiting (23)-(28), we conclude from (III-B) that the following relation holds for the error vector:

 \mbox{\boldmathe}[n+1]=ˆW(I−MˆD[n]Q)\mbox{\boldmathe}[n]+ˆWMˆD[n]\mbox{% \boldmathg}[n]. (29)

This relation tells us how the network error vector evolves over time. As we can notice from (29), the sampling strategy affects the recursion in two ways: (a) it modifies the iteration matrix of the error; (b) it selects the noise contribution only from a subset of nodes at time . Relation (29) will be the launching point for the mean-square analysis carried out in the sequel. Before moving forward, we introduce an independence assumption on the sampling strategy, and a small step-sizes assumption.

Assumption 3 (Independent sampling): The sampling process is temporally and spatially independent, for all and .

Assumption 4 (Small step-sizes): The step-sizes are sufficiently small so that terms that depend on higher-order powers of can be neglected.

We now proceed by illustrating the mean-square stability and steady-state performance of the algorithm in (III-B).

### Iv-a Mean-Square Stability

We now examine the behavior of the mean-square deviation for any of the nodes in the graph. Following energy conservation arguments [31, 36], we can establish the following variance relation:

 E∥\mbox{\boldmathe}[n+1]∥2Σ= E∥\mbox{\boldmathe}[n]∥2Σ′ +E[\mbox{\boldmathg}[n]Hˆ% D[n]MˆWTΣˆWMˆD[n]\mbox{\boldmathg}[n]] (30)

where is any Hermitian nonnegative-definite matrix that we are free to choose, and

 Σ′=E(I−QˆD[n]M)ˆWTΣˆW(I−MˆD[n]Q). (31)

Moreover, setting

 G=E[\mbox{\boldmathg}[n]\mbox{\boldmathg}[n]H]=diag{σ21% \mbox{\boldmathc}1\mbox{\boldmathc}H1,…,σ2N\mbox{\boldmathc}N\mbox{\boldmathc}HN}, (32)

we can rewrite (IV-A) in the form

 E∥\mbox{\boldmathe}[n+1]∥2Σ=E∥\mbox{\boldmathe}[n]∥2Σ′+Tr(ΣˆW% MˆPGMˆWT) (33)

where denotes the trace operator,

 ˆP=E{ˆD[n]}=P⊗I|\mbox{F}|,

and we have exploited the relation [cf. (26), (32), and Assumption 3]

 E{ˆD[n]\mbox{\boldmathg}[n]\mbox{\boldmathg}[n]HˆD[n]}=E{ˆD[n]\mbox{\boldmathg}[n]\mbox{\boldmathg}[n]H}=ˆPG.

Let and , where the notation stacks the columns of on top of each other and is the inverse operation. We will use interchangeably the notation and to denote the same quantity . Using the Kronecker product property , we can vectorize both sides of (31) and conclude that (31) can be replaced by the simpler linear vector relation: , where is the matrix:

 H =E{(I−QTˆD[n]M)ˆWT⊗(I−QˆD[n]M)ˆWT} =(I⊗I)(I−I⊗QˆPM−QTˆPM⊗I +E{QTˆD[n]M⊗QˆD[n]M})(WT⊗WT) (34)

The last term in (IV-A) can be computed in closed form. In particular, from (24), (26), and (27), it is easy to see how the term (and ) in (IV-A) has a block diagonal structure, which can be recast as:

 (35)

where , and , with denoting the -th canonical vector. Thus, exploiting (35), we obtain

 E{QTˆD[n]M⊗QˆD[n]M} =N∑i=1N∑j=1μiμjm(2)i,jCTi⊗Cj (36)

where, exploiting Assumption 3, we have

 m(2)i,j=E{di[n]dj[n]}={pi,ifi=j;pipj,ifi≠j. (37)

Substituting (IV-A) in (IV-A), we obtain the final expression:

 H =(I⊗I)(I−I⊗QˆPM−QTˆPM⊗I +N∑i=1N∑j=1μiμjm(2)i,jCTi⊗Cj)(% WT⊗WT). (38)

Now, using the property , we can rewrite (33) as follows:

 E∥\mbox{\boldmathe}[n+1]∥2σ=E∥\mbox{\boldmathe}[n]∥2Hσ+vec(ˆWMˆPGMˆWT)Tσ (39)

The following theorem guarantees the asymptotic mean-square stability (i.e., stability in the mean and mean-square sense) of the diffusion strategy (III-B).

Theorem 1 (Mean-square stability) Assume model (5), condition (12), Assumptions 2, 3, and 4 hold. Then, for any initial condition and choice of the matrices satisfying (21) and , the algorithm (III-B) is mean-square stable.

Proof. Let . From (39), we get

 E∥\mbox{\boldmathe}[n]∥2σ=E∥\mbox{\boldmathe}[0]∥2Hnσ+\mbox{\boldmathr}Tn−1∑l=0Hlσ (40)

where is the initial condition. We first note that if is stable, as . In this way, the first term on the RHS of (40) vanishes asymptotically. At the same time, the convergence of the second term on the RHS of (40) depends only on the geometric series of matrices , which is known to be convergent to a finite value if the matrix is a stable matrix [53]. Thus, the stability of matrix is a sufficient condition for the convergence of the mean-square recursion in (40) to a steady-state value.
To verify the stability of , we use the following approximation, which is accurate under Assumption 4, see, e.g., [31, 34, 35]. Then, we approximate (IV-A) as: 222It is immediate to see that (IV-A) can be obtained from (IV-A) by replacing the term with . This step coincides with substituting the terms in (IV-A)-(37) with , for all . Such approximation appears in (IV-A) only in the term and, consequently, under Assumption 4 it is assumed to produce a negligible deviation from (IV-A).

 H ≈(I⊗I)(I−I⊗QˆPM−QTˆ% PM⊗I +QTˆPM⊗QˆPM)(WT⊗WT)=BT⊗BH (41)

with given by

 B=ˆW(I−MˆPQ). (42)

Thus, from (IV-A), exploiting the properties of the Kronecker product, we deduce that matrix in (IV-A) is stable if matrix in (42) is also stable. Under the assumptions of Theorem 1, in the Appendix, we provide the proof of the stability of matrix in (42). This concludes the proof of Theorem 1.

Remark 2: The assumptions used in Theorem 1 are sufficient conditions for graph signal reconstruction using the ATC diffusion algorithm in (III-B). In particular, (12) requires that the network collects samples from a sufficiently large number of nodes on average, and guarantees the existence of a unique solution of the normal equations in (7). Furthermore, (12) and Assumption 4 are also instrumental to prove the stability of matrix in (42) [and of in (IV-A)] and, consequently, the stability in the mean and mean-square sense of the diffusion algorithm in (III-B) (see the Appendix).

Remark 3: In Theorem 1, we require the matrix to be doubly stochastic. Note that, from the definition of weights in (19), under assumption 2, this further condition would imply that the step-sizes must be chosen constant for all . However, as a consequence of Theorem 1, our strategy works for any choice of doubly stochastic matrices , without imposing the constraint that the step-sizes must be chosen constant for all . Several possible combination rules have been proposed in the literature, such as the Laplacian or the Metropolis-Hastings weights, see, e.g. [51], [54], [31].

Taking the limit as (assuming convergence conditions are satisfied), we deduce from (39) that:

 limn→∞E∥\mbox{\boldmathe}[n]∥2(I−H)σ=vec(ˆWMˆPGMˆWT)Tσ. (43)

Expression (43) is a useful result: it allows us to derive several performance metrics through the proper selection of the free weighting parameter (or ), as was done in [31]. For example, the Mean-Square Deviation (MSD) for any node is defined as the steady-state value , as , where , for all , with defined in (III-B). From (III-B), this value can be obtained by computing , with a block weighting matrix , where , with denoting the -th canonical vector. Then, from (43), the MSD at node can be obtained as:

 MSDi =limn→∞E|˜xi[n]|2=limn→∞E∥\mbox{% \boldmathe}[n]∥2Ri⊗\mbox{\boldmathc}i\mbox{\boldmathc}i (44) =vec(ˆWMˆPGMˆWT)T(I−H)−1vec(Ri⊗\mbox{\boldmathc}i\mbox{\boldmathc}Hi).

Finally, letting , from (44), the network is given by:

 MSD =limn→∞E∥˜\mbox{\boldmathx}[n]∥2 =vec(ˆWMˆPGMˆWT)T(I−H)−1\mbox{\boldmathq}, (45)

where [cf. (27)]. In the sequel, we will confirm the validity of these theoretical expressions by comparing them with numerical results.

## V Distributed Graph Sampling Strategies

As illustrated in the previous sections, the properties of the proposed distributed algorithm in (III-B) for graph signal reconstruction strongly depend on the expected sampling set . Thus, building on the results obtained in Sec. IV, it is fundamental to devise (possibly distributed) strategies that design the set , with the aim of reducing the computational/memory burden while still guaranteing provable theoretical performance. To this purpose, in this section we propose a distributed method that iteratively selects vertices from the graph in order to build an expected sampling set that enables reconstruction with a limited number of nodes, while ensuring guaranteed performance of the learning task.

To select the best sampling strategy, one should optimize some (global) performance criterion, e.g. the MSD in (IV-B), with respect to the expected sampling set , or, equivalently, the weighted vertex limiting operator . However, the solution of such problem would require global information about the entire graph to be collected into a single processing unit. To favor distributed implementations, we propose to consider a different (but related) performance metric for the selection of the sampling set, which comes directly from the solution of the normal equations in (