1 Introduction
This chapter puts forth an optimization framework for learning over networks, that entails decentralized processing of training data acquired by interconnected nodes. Such an approach is of paramount importance when communication of training data to a central processing unit is prohibited due to e.g., communication cost or privacy reasons. The sotermed innetwork processing paradigm for decentralized learning is based on successive refinements of local model parameter estimates maintained at individual network nodes. In a nutshell, each iteration of this broad class of fully decentralized algorithms comprises: (i) a communication step where nodes exchange information with their neighbors through e.g., the shared wireless medium or Internet backbone; and (ii) an update step where each node uses this information to refine its local estimate. Devoid of hierarchy and with their decentralized innetwork processing, local e.g., estimators should eventually consent to the global estimator sought, while fully exploiting existing spatiotemporal correlations to maximize estimation performance. In most cases, consensus can formally be attained asymptotically in time. However, a finite number of iterations will suffice to obtain results that are sufficiently accurate for all practical purposes.
In this context, the approach followed here entails reformulating a generic learning task as a convex constrained optimization problem, whose structure lends itself naturally to decentralized implementation over a network graph. It is then possible to capitalize upon this favorable structure by resorting to the alternatingdirection method of multipliers (ADMM), an iterative optimization method that can be traced back to Glowinski_Marrocco_ADMM_1975 (see also Gabay_Mercier_ADMM_1976 ), and which is specially wellsuited for parallel processing bertsi97book ; Boyd_ADMM
. This way simple decentralized recursions become available to update each node’s local estimate, as well as a vector of dual prices through which networkwide agreement is effected.
Problem statement. Consider a network of nodes in which scarcity of power and bandwidth resources encourages only singlehop internode communications, such that the th node communicates solely with nodes in its singlehop neighborhood . Internode links are assumed symmetric, and the network is modeled as an undirected graph whose vertices are the nodes and its edges represent the available communication links. As it will become clear through the different application domains studied here, nodes could be wireless sensors, wireless access points (APs), electrical buses, sensing cognitive radios, or routers, to name a few examples. Node acquires measurements stacked in the vector containing information about the unknown model parameters in , which the nodes need to estimate. Let collect measurements acquired across the entire network. Many popular centralized schemes obtain an estimate as follows
(1) 
In the decentralized learning problem studied here though, the summands are assumed to be local cost functions only known to each node . Otherwise sharing this information with a centralized processor, also referred to as fusion center (FC), can be challenging in various applications of interest, or, it may be even impossible in e.g., wireless sensor networks (WSNs) operating under stringent power budget constraints. In other cases such as the Internet or collaborative healthcare studies, agents may not be willing to share their private training data but only the learning results. Performing the optimization (1) in a centralized fashion raises robustness concerns as well, since the central processor represents an isolated point of failure.
In this context, the objective of this chapter is to develop a decentralized algorithmic framework for learning tasks, based on innetwork processing of the locally available data. The described setup naturally suggests three characteristics that the algorithms should exhibit: c1) each node should obtain an estimate of , which coincides with the corresponding solution of the centralized estimator (1) that uses the entire data ; c2) processing per node should be kept as simple as possible; and c3) the overhead for internode communications should be affordable and confined to singlehop neighborhoods. It will be argued that such an ADMMbased algorithmic framework can be useful for contemporary applications in the domain of wireless communications and networking.
Prior art. Existing decentralized solvers of (1
) can be classified in two categories: C1) those obtained by modifying centralized algorithms and operating in the primal domain; and C2) those handling an equivalent constrained form of (
1) (see (2) in Section 2), and operating in the primaldual domain.Primaldomain algorithms under C1 include the (sub)gradient method and its variants Nedic2009 ; Ram2010 ; Yuan2013 ; Jakovetic2013 , the incremental gradient method Rabbat2005inc , the proximal gradient method Chen2012 , and the dual averaging method Duchi2012 ; Tsianos2012acc . Each node in these methods, averages its local iterate with those of neighbors and descends along its local negative (sub)gradient direction. However, the resultant algorithms are limited to inexact convergence when using constant stepsizes Nedic2009 ; Yuan2013 . If diminishing stepsizes are employed instead, the algorithms can achieve exact convergence at the price of slowing down speed Jakovetic2013 ; Rabbat2005inc ; Duchi2012 . A constantstepsize exact firstorder algorithm is also available to achieve fast and exact convergence, by correcting error terms in the distributed gradient iteration with twostep historic information Shi2014extra .
Primaldual domain algorithms under C2 solve an equivalent constrained form of (1), and thus drive local solutions to reach global optimality. The dual decomposition method is hence applicable because (sub)gradients of the dual function depend on local and neighboring iterates only, and can thus be computed without global cooperation Rabbat2005 . ADMM modifies the dual decomposition by regularizing the constraints with a quadratic term, which improves numerical stability as well as rate of convergence, as will be demonstrated later in this chapter. Per ADMM iteration, each node solves a subproblem that can be demanding. Fortunately, these subproblems can be solved inexactly by running onestep gradient or proximal gradient descent iterations, which markedly mitigate the computation burden Ling2014icassp ; Chang2014icassp . A sequential distributed ADMM algorithm can be found in Wei2012 .
Chapter outline. The remainder of this chapter is organized as follows. Section 2 describes a generic ADMM framework for decentralized learning over networks, which is at the heart of all algorithms described in the chapter and was pioneered in sg06asilomar ; srg08tsp for innetwork estimation using WSNs. Section 3 focuses on batch estimation as well as (un)supervised inference, while Section 4
deals with decentralized adaptive estimation and tracking schemes where network nodes collect data sequentially in time. Internet traffic anomaly detection and spectrum cartography for wireless CR networks serve as motivating applications for the sparsityregularized rank minimization algorithms developed in Section
5. Fundamental results on the convergence and convergence rate of decentralized ADMM are stated in Section 6.2 InNetwork Learning with ADMM in a Nutshell
Since local summands in (1) are coupled through a global variable , it is not straightforward to decompose the unconstrained optimization problem in (1). To overcome this hurdle, the key idea is to introduce local variables which represent local estimates of per network node sg06asilomar ; srg08tsp . Accordingly, one can formulate the constrained minimization problem
(2) 
The “consensus” equality constraints in (2) ensure that local estimates coincide within neighborhoods. Further, if the graph is connected then consensus naturally extends to the whole network, and it turns out that problems (1) and (2) are equivalent in the sense that srg08tsp . Interestingly, the formulation in (2) exhibits a separable structure that is amenable to decentralized minimization. To leverage this favorable structure, the alternating direction method of multipliers (ADMM), see e.g., (bertsi97book, , pg. 253261), can be employed here to minimize (2) in a decentralized fashion. This procedure will yield a distributed estimation algorithm whereby local iterates , with denoting iterations, provably converge to the centralized estimate in (1); see also Section 6.
To facilitate application of ADMM, consider the auxiliary variables , and reparameterize the constraints in (2) with the equivalent ones
s. to  (3) 
Variables are only used to derive the local recursions but will be eventually eliminated. Attaching Lagrange multipliers to the constraints (2), consider the augmented Lagrangian function
(4) 
where the constant is a penalty coefficient. To minimize (2), ADMM entails an iterative procedure comprising three steps per iteration
 [S1]

Multiplier updates:
 [S2]

Local estimate updates:
 [S3]

Auxiliary variable updates:
where and in [S1]. Reformulating the generic learning problem (1) as (2) renders the augmented Lagrangian in (2) highly decomposable. The separability comes in two flavors, both with respect to the sets and of primal variables, as well as across nodes . This in turn leads to highly parallelized, simplified recursions corresponding to the aforementioned steps [S1][S3]. Specifically, as detailed in e.g., srg08tsp ; sgrr08tsp ; SMG_D_LMS ; pfacgg10jmlr ; mateos_dlasso ; mmg13tsp , it follows that if the multipliers are initialized to zero, the ADMMbased decentralized algorithm reduces to the following updates carried out locally at every node Innetwork learning algorithm at node , for :
(5)  
(6) 
where , and all initial values are set to zero.
Recursions (5) and (6) entail local updates, which comprise the general purpose ADMMbased decentralized learning algorithm. The inherently redundant set of auxiliary variables in and corresponding multipliers have been eliminated. Each node, say the th one, does not need to separately keep track of all its nonredundant multipliers , but only to update the (scaled) sum . In the end, node has to store and update only two dimensional vectors, namely and . A unique feature of innetwork processing is that nodes communicate their updated local estimates (and not their raw data ) with their neighbors, in order to carry out the tasks (5)(6) for the next iteration.
3 Batch InNetwork Estimation and Inference
3.1 Decentralized Signal Parameter Estimation
Many workhorse estimation schemes such as maximum likelihood estimation (MLE), leastsquares estimation (LSE), best linear unbiased estimation (BLUE), as well as linear minimum meansquare error estimation (LMMSE) and the maximum a posteriori (MAP) estimation, all can be formulated as a minimization task similar to (
1); see e.g. Estimation_Theory . However, the corresponding centralized estimation algorithms fall short in settings where both the acquired measurements and computational capabilities are distributed among multiple spatially scattered sensing nodes, which is the case with WSNs. Here we outline a novel batch decentralized optimization framework building on the ideas in Section 2, that formulates the desired estimator as the solution of a separable constrained convex minimization problem tackled via ADMM; see e.g., bertsi97book ; Boyd_ADMM ; srg08tsp ; sgrr08tsp for further details on the algorithms outlined here.Depending on the estimation technique utilized, the local cost functions in (1) should be chosen accordingly, see e.g., Estimation_Theory ; srg08tsp ; sgrr08tsp . For instance, when is assumed to be an unknown deterministic vector, then:

If corresponds to the centralized MLE then
is the negative loglikelihood capturing the data probability density function (pdf), while the networkwide data
are assumed statistically independent. 
If corresponds to the BLUE (or weighted leastsquares estimator) then , where denotes the covariance of the data , and is a known fitting matrix.
When is treated as a random vector, then:

If corresponds to the centralized MAP estimator then accounts for the data pdf, and for the prior pdf of , while data are assumed conditionally independent given .

If corresponds to the centralized LMMSE then , where denotes the crosscovariance of with , while stands for the th block subvector of .
Substituting in (6) the specific for each of the aforementioned estimation tasks, yields a family of batch ADMMbased decentralized estimation algorithms. The decentralized BLUE algorithm will be described in this section as an example of decentralized linear estimation.
Recent advances in cyberphysical systems have also stressed the need for decentralized nonlinear leastsquares (LS) estimation. Monitoring the power grid for instance, is challenged by the nonconvexity arising from the nonlinear AC power flow model; see e.g., (Wollenbergbook, , Ch. 4), while the interconnection across local transmission systems motivates their operators to collaboratively monitor the global system state. Interestingly, this nonlinear (specifically quadratic) estimation task can be convexified to a semidefinite program (SDP) (Boyd_Convex, , pg. 168), for which a decentralized semidefinite programming (SDP) algorithm can be developed by leveraging the batch ADMM; see also Wen2010 for an ADMMbased centralized SDP precursor.
Decentralized BLUE
The minimization involved in (6) can be performed locally at sensor by employing numerical optimization techniques Boyd_Convex . There are cases where the minimization in (6) yields a closedform and easy to implement updating formula for . If for example network nodes wish to find the BLUE estimator in a distributed fashion, the local cost is , and (6) becomes a strictly convex unconstrained quadratic program which admits the following closedform solution (see details in srg08tsp ; MSG_D_RLS )
(7) 
The pair (5) and (7) comprise the decentralized (D) BLUE algorithm sg06asilomar ; srg08tsp
. For the special case where each node acquires unitvariance scalar observations
, there is no fitting matrix and is scalar (i.e., ); DBLUE offers a decentralized algorithm to obtain the networkwide sample average . The update rule for the local estimate is obtained by suitably specializing (7) to(8) 
Different from existing distributed averaging approaches Barbarossa_Scu_Coupled_Osci ; dimakis10 ; Consensus_Averaging ; xbk07jpdc , the ADMMbased one originally proposed in sg06asilomar ; srg08tsp allows the decentralized computation of general nonlinear estimators that may be not available in closed form and cannot be expressed as “averages.” Further, the obtained recursions exhibit robustness in the presence of additive noise in the internode communication links.
Decentralized SDP
Consider now that each scalar in adheres to a quadratic measurement model in plus additive Gaussian noise, where the centralized MLE requires solving a nonlinear leastsquares problem. To tackle the nonconvexity due to the quadratic dependence, the task of estimating the state can be reformulated as that of estimating the outerproduct matrix . In this reformulation is a linear function of , given by with a known matrix hzgg_jstsp14 . Motivated by the separable structure in (2), the nonlinear estimation problem can be similarly formulated as
s. to  
(9) 
where the positivesemidefiniteness and rank constraints ensure that each matrix is an outerproduct matrix. By dropping the nonconvex rank constraints, the problem (3.1) becomes a convex semidefinite program (SDP), which can be solved in a decentralized fashion by adopting the batch ADMM iterations (5) and (6).
This decentralized SDP approach has been successfully employed for monitoring largescale power networks gg_spmag13 . To estimate the complex voltage phasor all nodes (a.k.a. power system state), measurements are collected on real/reactive power and voltage magnitude, all of which have quadratic dependence on the unknown states. GaussNewton iterations have been the ‘workhorse’ tool for this nonlinear estimation problem; see e.g., SE_book ; Wollenbergbook . However, the iterative linearization therein could suffer from convergence issues and local optimality, especially due to the increasing variability in power grids with high penetration of renewables. With improved communication capabilities, decentralized state estimation among multiple control centers has attracted growing interest; see Fig. 1 illustrating three interconnected areas aiming to achieve the centralized estimation collaboratively.
A decentralized SDPbased state estimator has been developed in hzgg_jstsp14 with reduced complexity compared to (3.1). The resultant algorithm involves only internal voltages and those of nexthop neighbors in the local matrix ; e.g., in Fig. 1 is identified by the dashed lines. Interestingly, the positivesemidefiniteness constraint for the overall decouples nicely into that of all local , and the estimation error converges to the centralized performance within only a dozen iterations. The decentralized SDP framework has successfully addressed a variety of power system operational challenges, including a distributed microgrid optimal power flow solver in edhzgg_tsg13 ; see also gg_spmag13 for a tutorial overview of these applications.
3.2 Decentralized Inference
Along with decentralized signal parameter estimation, a variety of inference tasks become possible by relying on the collaborative sensing and computations performed by networked nodes. In the special context of resourceconstrained WSNs deployed to determine the common messages broadcast by a wireless AP, the relatively limited node reception capability makes it desirable to design a decentralized detection scheme for all sensors to attain sufficient statistics for the global problem. Another exciting application of WSNs is environmental monitoring for e.g., inferring the presence or absence of a pollutant over a geographical area. Limited by the local sensing capability, it is important to develop a decentralized learning framework such that all sensors can collaboratively approach the performance as if the network wide data had been available everywhere (or at a FC for that matter). Given the diverse inference tasks, the challenge becomes how to design the best internode information exchange schemes that would allow for minimal communication and computation overhead in specific applications.
Decentralized Detection
Message decoding. A decentralized detection framework is introduced here for the message decoding task, which is relevant for diverse wireless communications and networking scenarios. Consider an AP broadcasting a coded block to a network of sensors, all of which know the codebook that belongs to. For simplicity assume binary codewords, and that each node receives a samelength block of symbols through a discrete, memoryless, symmetric channel that is conditionally independent across sensors. Sensor knows its local channel from the AP, as characterized by the conditional pdf per bit . Due to conceivably low signaltonoiseratio (SNR) conditions, each lowcost sensor may be unable to reliably decode the message. Accordingly, the need arises for information exchanges among singlehop neighboring sensors to achieve the global (that is, centralized) error performance. Given per sensor , the assumption on memoryless and independent channels yields the centralized maximumlikelihood (ML) decoder as
(10) 
ML decoding amounts to deciding the most likely codeword among multiple candidate ones and, in this sense, it can be viewed as a test of multiple hypotheses. In this general context, belief propagation approaches have been developed in sas06tsp , so that all nodes can cooperate to learn the centralized likelihood per hypothesis. However, even for linear binary block codes, the number of hypotheses, namely the cardinality of , grows exponentially with the codeword length. This introduces high communication and computation burden for the lowcost sensor designs.
The key here is to extract minimal sufficient statistics for the centralized decoding problem. For binary codes, the loglikelihood terms in (10) become , where
(11) 
is the local loglikelihood ratio (LLR) for the bit at sensor . Ignoring all constant terms , the ML decoding objective ends up only depending on the sum LLRs, as given by . Clearly, the sufficient statistic for solving (10) is the sum of all local LLR terms, or equivalently, the average for each bit . Interestingly, the average of is one instance of the BLUE discussed in Section 3.1 when , since
(12) 
This way, the ADMMbased decentralized learning framework in Section 2 allows for all sensors to collaboratively attain the sufficient statistic for the decoding problem (10) via innetwork processing. Each sensor only needs to estimate a vector of the codeword length , which bypasses the exponential complexity under the framework of belief propagation. As shown in hzggac08tsp , decentralized soft decoding is also feasible since the a posteriori probability (APP) evaluator also relies on LLR averages which are sufficient statistics, where extensions to nonbinary alphabet codeword constraints and random failing intersensor links are also considered.
The bit error rate (BER) versus SNR plot in Fig. 2 demonstrates the performance of ADMMbased innetwork decoding of a convolutional code with and . This numerical test involves sensors and AWGN APsensor channels with . Four schemes are compared: (i) the local ML decoder based on persensor data only (corresponds to the curve marked as since it is used to initialize the decentralized iterations); (ii) the centralized benchmark ML decoder (corresponds to ); (iii) the innetwork decoder which forms using “consensusaveraging” linear iterations Consensus_Averaging ; and, (iv) the ADMMbased decentralized algorithm. Indeed, the ADMMbased decoder exhibits faster convergence than its consensusaveraging counterpart; and surprisingly, only 10 iterations suffice to bring the decentralized BER very close to the centralized performance.
Message demodulation. In a related detection scenario the common AP message can be mapped to a spacetime matrix, with each entry drawn from a finite alphabet . The received block per sensor typically admits a linear input/output relationship . Matrix is formed from the fading APsensor channel, and stands for the additive white Gaussian noise of unit variance, that is assumed uncorrelated across sensors. Since lowcost sensors have very limited budget on number of antennas compared to the AP, the length of is much shorter than (i.e., ). Hence, the local linear demodulator using may not even be able to identify . Again, it is critical for each sensor to cooperate with its neighbors to collectively form the global ML demodulator
(13) 
where and are the sample (cross)covariance terms. To solve (13) locally, it suffices for each sensor to acquire the networkwide average of , as well as that of , as both averages constitute the minimal sufficient statistics for the centralized demodulator. Arguments similar to decentralized decoding lead to ADMM iterations that (as with BLUE) attain locally these average terms. These iterations constitute a viable decentralized demodulation method, whose performance analysis in hzacgg10twc reveals that its error diversity order can approach the centralized one within only a dozen of iterations.
As demonstrated by the decoding and demodulation tasks, the cornerstone of developing a decentralized detection scheme is to extract the minimal sufficient statistics for the centralized hypothesis testing problem. This leads to significant complexity reduction in terms of communications and computational overhead.
Decentralized Support Vector Machines
The merits of support vector machines (SVMs) in a centralized setting have been well documented in various supervised classification tasks including surveillance, monitoring, and segmentation, see e.g.,
smola . These applications often call fordecentralized supervised learning
solutions, when limited training data are acquired at different locations and a central processing unit is costly or even discouraged due to, e.g., scalability, communication overhead, or privacy reasons. Noteworthy examples include WSNs for environmental or structural health monitoring, as well as diagnosis of medical conditions from patient’s records distributed at different hospitals.In this innetwork classification task, a labeled training set of size is available per node , where is the input data vector and denotes its corresponding class label. Given all networkwide training data , the centralized SVM seeks a maximummargin linear discriminant function , by solving the following convex optimization problem smola
(14)  
s. to  
where the slack variables account for nonlinearly separable training sets, and is a tunable positive scalar that allows for controlling model complexity. Nonlinear discriminant functions can also be accommodated after mapping input vectors to a higher (possibly infinite)dimensional space using e.g., kernel functions, and pursuing a generalized maximummargin linear classifier as in (14). Since the SVM classifier (14) couples the local datasets, early distributed designs either rely on a centralized processor so they are not decentralized van08dpsvm , or, their performance is not guaranteed to reach that of the centralized SVM navia06dsvm .
A fresh view of decentralized SVM classification is taken in pfacgg10jmlr , which reformulates (14) to estimate the parameter pair from all local data after eliminating slack variables , namely
(15) 
Notice that (15) has the same decomposable structure that the general decentralized learning task in (1), upon identifying the local cost , where , and . Accordingly, all network nodes can solve (15) in a decentralized fashion via iterations obtained following the ADMMbased algorithmic framework of Section 2. Such a decentralized ADMMDSVM scheme is provably convergent to the centralized SVM classifier (14), and can also incorporate nonlinear discriminant functions as detailed in pfacgg10jmlr .
To illustrate the performance of the ADMMDSVM algorithm in pfacgg10jmlr , consider a randomly generated network with nodes. Each node acquires labeled training examples from two different classes, which are equiprobable and consist of random vectors drawn from a twodimensional (i.e.,
) Gaussian distribution with common covariance matrix
, and mean vectors and , respectively. The Bayes optimal classifier for this 2class problem is linear (duda, , Ch. 2). To visualize this test case, Fig. 3 depicts the global training set, along with the linear discriminant functions found by the centralized SVM (14) and the ADMMDSVM at two different nodes after 400 iterations. Local SVM results for two different nodes are also included for comparison. It is apparent that ADMMDSVM approaches the decision rule of its centralized counterpart, whereas local classifiers deviate since they neglect most of the training examples in the network.Decentralized Clustering
Unsupervised learning using a network of wireless sensors as an exploratory infrastructure is well motivated for inferring hidden structures in distributed data collected by the sensors. Different from supervised SVMbased classification tasks, each node has available a set of unlabeled observations , drawn from a total of classes. In this network setting, the goal is to design local clustering rules assigning each to a cluster . Again, the desiderata is a decentralized algorithm capable of attaining the performance of a benchmark clustering scheme, where all are centrally available for joint processing.
Various criteria are available to quantify similarity among observations in a centralized setting, and a popular selection is the deterministic partitional clustering (DPC) one entailing prototypical elements (a.k.a. cluster centroids) per class in order to avoid comparisons between every pair of observations. Let denote the prototype element for class , and the membership coefficient of to class . A natural clustering problem amounts to specifying the family of clusters with centroids , such that the sum of squarederrors is minimized; that is
(16) 
where is a tuning parameter, and denotes the convex set of constraints on all membership coefficients. With and fixed, (16
) becomes a linear program in
. Consequently, (16) admits binary optimal solutions giving rise to the sotermed hard assignments, by choosing the cluster for whenever . Otherwise, for the optimal coefficients generally result in soft membership assignments, and the optimal cluster is for . In either case, the DPC clustering problem (16) is NPhard, which motivates the (suboptimal) Kmeans algorithm that, on a per iteration basis, proceeds in twosteps to minimize the cost in (
16) w.r.t.: (S1) with fixed; and (S2) with fixed lloyd82PCM . Convergence of this twostep alternatingminimization scheme is guaranteed at least to a local minimum. Nonetheless, Kmeans requires central availability of global information (those variables that are fixed per step), which challenges innetwork implementations. For this reason, most early attempts are either confined to specific communication network topologies, or, they offer no closedform local solutions; see e.g., Nowak03dem ; whk08ICML .To address these limitations, pfacgg11jstsp casts (16) [yet another instance of (1)] as a decentralized estimation problem. It is thus possible to leverage ADMM iterations and solve (16) in a decentralized fashion through information exchanges among singlehop neighbors only. Albeit the nonconvexity of (16), the decentralized DPC iterations in pfacgg11jstsp provably approach a local minimum arbitrarily closely, where the asymptotic convergence holds for hard Kmeans with . Further extensions in pfacgg11jstsp
include a decentralized expectationmaximization algorithm for probabilistic partitional clustering, and methods to handle unknown number of classes.
Clustering of oceanographic data. Environmental monitoring is a typical application of WSNs. In WSNs deployed for oceanographic monitoring, the cost of computation per node is lower than the cost of accessing each node’s observations oceansensors . This makes the option of centralized processing less attractive, thus motivating decentralized processing. Here we test the decentralized DPC schemes of pfacgg11jstsp on real data collected by multiple underwater sensors in the Mediterranean coast of Spain WOD , with the goal of identifying regions sharing common physical characteristics. A total of feature vectors were selected, each having entries the temperature (C) and salinity (psu) levels (). The measurements were normalized to have zero mean, unit variance, and they were grouped in blocks (one per sensor) of measurements each. The algebraic connectivity of the WSN is 0.2289 and the average degree per node is 4.9. Fig. 4 (left) shows the performance of 25 Monte Carlo runs for the hardDKM algorithm with different values of the parameter . The best average convergence rate was obtained for , attaining the average centralized performance after 300 iterations. Tests with different values of and are also included in Fig. 4 (left) for comparison. Note that for and hardDKM hovers around a point without converging. Choosing a larger guarantees convergence of the algorithm to a unique solution. The clustering results of hardDKM at iterations for and are depicted in Fig. 4 (right).
4 Decentralized Adaptive Estimation
Sections 2 and 3 dealt with decentralized batch estimation, whereby network nodes acquire data only once and then locally exchange messages to reach consensus on the desired estimators. In many applications however, networks are deployed to perform estimation in a constantly changing environment without having available a complete statistical description of the underlying processes of interest, e.g., with timevarying thermal or seismic sources. This motivates the development of decentralized adaptive estimation schemes, where nodes collect data sequentially in time and local estimates are recursively refined “onthefly.” In settings where statistical state models are available, it is prudent to develop modelbased tracking approaches implementing innetwork Kalman or particle filters. Next, Section 2’s scope is broadened to facilitate realtime (adaptive) processing of network data, when the local costs in (1) and unknown parameters are allowed to vary with time.
4.1 Decentralized LeastMean Squares
A decentralized leastmean squares (LMS) algorithm is developed here for adaptive estimation of (possibly) nonstationary parameters, even when statistical information such as ensemble data covariances are unknown. Suppose network nodes are deployed to estimate a signal vector in a collaborative fashion subject to singlehop communication constraints, by resorting to the linear LMS criterion, see e.g., sk95book ; Diffusion_LMS ; SMG_D_LMS . Per time instant , each node has available a regression vector and acquires a scalar observation , both assumed zeromean without loss of generality. Introducing the global vector and matrix , the global timedependent LMS estimator of interest can be written as (sk95book, ; Diffusion_LMS, ; SMG_D_LMS, , p. 14)
(17) 
For jointly widesense stationary , solving (17) leads to the wellknown Wiener filter estimate , where and ; see e.g., (sk95book, , p. 15).
For the cases where the auto and crosscovariance matrices and are unknown, the approach followed here to develop the decentralized (D) LMS algorithm includes two main building blocks: (i) recast (17) into an equivalent form amenable to innetwork processing via the ADMM framework of Section 2; and (ii) leverage stochastic approximation iterations kushner to obtain an adaptive LMSlike algorithm that can handle the unavailability/variation of statistical information. Following those algorithmic construction steps outlined in Section 2, the following updating recursions are obtained for the multipliers and the local estimates at time instant and
(18)  
(19) 
It is apparent that after differentiating (19) and setting the gradient equal to zero, can be obtained as the root of an equation of the form
(20) 
where corresponds to the stochastic gradient of the cost in (19). However, the previous equation cannot be solved since the nodes do not have available any statistical information about the acquired data. Inspired by stochastic approximation techniques (such as the celebrated RobbinsMonro algorithm; see e.g.,(kushner, , Ch. 1)) which iteratively find the root of (20) given noisy observations , one can just drop the unknown expected value to obtain the following DLMS (i.e., stochastic gradient) updates
(21) 
where denotes a constant stepsize, and is twice the local a priori error.
Recursions (18) and (21) constitute the DLMS algorithm, which can be viewed as a stochasticgradient counterpart of DBLUE in Section 3.1. DLMS is a pioneering approach for decentralized online learning, which blends for the first time affordable (firstorder) stochastic approximation steps with parallel ADMM iterations. The use of a constant stepsize endows DLMS with tracking capabilities. This is desirable in a constantly changing environment, within which e.g., WSNs are envisioned to operate. The DLMS algorithm is stable and converges even in the presence of internode communication noise (see details in SMG_D_LMS ; MSG_D_LMS ). Further, closedform expressions for the evolution and the steadystate meansquare error (MSE), as well as selection guidelines for the stepsize can be found in MSG_D_LMS .
Here we test the tracking performance of DLMS with a computer simulation. For a random geometric graph with nodes, networkwide observations are linearly related to a largeamplitude slowly timevarying parameter vector . Specifically, , where with
. The driving noise is normally distributed with
. To model noisy links, additive white Gaussian noise with variance is present at the receiving end. For , Fig. 5 (left) depicts the local performance of two representative nodes through the evolution of the excess meansquare error and the meansquare deviation figures of merit. Both noisy and ideal links are considered, and the empirical curves closely follow the theoretical trajectories derived in MSG_D_LMS . Steadystate limiting values are also extremely accurate. As intuitively expected and suggested by the analysis, a performance penalty due to nonideal links is also apparent. Fig. 5 (right) illustrates how the adaptation level affects the resulting pernode estimates when tracking timevarying parameters with DLMS. For (slow adaptation) and (near optimal adaptation), we depict the third entry of the parameter vector and the respective estimates from the randomly chosen sixth node. Under optimal adaptation the local estimate closely tracks the true variations, while – as expected – for the smaller stepsize DLMS fails to provide an accurate estimate MSG_D_LMS ; sk95book .4.2 Decentralized Recursive LeastSquares
The recursive leastsquares (RLS) algorithm has wellappreciated merits for reducing complexity and storage requirements, in online estimation of stationary signals, as well as for tracking slowlyvarying nonstationary processes sk95book ; Estimation_Theory . RLS is especially attractive when the state and/or data model are not available (as with LMS), and fast convergence rates are at a premium. Compared to the LMS scheme, RLS typically offers faster convergence and improved estimation performance at the cost of higher computational complexity. To enable these valuable tradeoffs in the context of innetwork processing, the ADMM framework of Section 2 is utilized here to derive a decentralized (D) RLS adaptive scheme that can be employed for distributed localization and power spectrum estimation (see also MSG_D_RLS ; MG_D_RLS for further details on the algorithmic construction and convergence claims).
Consider the data setting and linear regression task in Section
4.1. The RLS estimator for the unknown parameter minimizes the exponentially weighted leastsquares (EWLS) cost, see e.g., sk95book ; Estimation_Theory(22) 
where is a forgetting factor, while the positive definite matrix is included for regularization. Note that in forming the EWLS estimator at time , the entire history of data for is incorporated in the online estimation process. Whenever , past data are exponentially discarded thus enabling tracking of nonstationary processes.
Again to decompose the cost function in (22), in which summands are coupled through the global variable , we introduce auxiliary variables that represent local estimates per node . These local estimates are utilized to form the convex constrained and separable minimization problem in (2), which can be solved using ADMM to yield the following decentralized iterations (details in MSG_D_RLS ; MG_D_RLS )
(23)  
(24) 
where and
(25)  
(26) 
The DRLS recursions (23) and (24) involve similar internode communication exchanges as in DLMS. It is recommended to initialize the matrix recursion with , where is chosen sufficiently large sk95book . The local estimates in DRLS converge in the meansense to the true (timeinvariant case), even when information exchanges are imperfect. Closedform expressions for the bounded estimation MSE along with numerical tests and comparisons with the incremental RLS inc_RLS and diffusion RLS Diffusion_RLS algorithms can be found in MG_D_RLS .
Decentralized spectrum sensing using WSNs. A WSN application where the need for linear regression arises, is spectrum estimation for the purpose of environmental monitoring. Suppose sensors comprising a WSN deployed over some area of interest observe a narrowband source to determine its spectral peaks. These peaks can reveal hidden periodicities due to e.g., a natural heat or seismic source. The source of interest propagates through multipath channels and is contaminated with additive noise present at the sensors. The unknown sourcesensor channels may introduce deep fades at the frequency band occupied by the source. Thus, having each sensor operating on its own may lead to faulty assessments. The available spatial diversity to effect improved spectral estimates, can only be achieved via sensor collaboration as in the decentralized estimation algorithms presented in this chapter.
Let denote the evolution of the source signal in time, and suppose that can be modeled as an autoregressive (AR) process (Stoica_Book, , p. 106)
where is the order of the AR process, while are the AR coefficients and
denotes driving white noise. The source propagates to sensor
via a channel modeled as an FIR filter , of unknown order and tap coefficients and is contaminated with additive sensing noise to yield the observationSince is an autoregressive moving average (ARMA) process, then Stoica_Book
(27) 
where the MA coefficients and the variance of the white noise process depend on , and the variance of the noise terms and . For the purpose of determining spectral peaks, the MA term in (27) can be treated as observation noise, i.e., . This is very important since this way sensors do not have to know the sourcesensor channel coefficients as well as the noise variances. Accordingly, the spectral content of the source can be estimated provided sensors estimate the coefficients . To this end, let be the unknown parameter of interest. From (27) the regression vectors are given as , and can be acquired directly from the sensor measurements without the need of training/estimation.
Performance of the decentralized adaptive algorithms described so far is illustrated next, when applied to the aforementioned power spectrum estimation task. For the numerical experiments, an ad hoc WSN with sensors is simulated as a realization of a random geometric graph. The sourcesensor channels corresponding to a few of the sensors are set so that they have a null at the frequency where the AR source has a peak, namely at . Fig. 6 (left) depicts the actual power spectral density (PSD) of the source as well as the estimated PSDs for one of the sensors affected by a bad channel. To form the desired estimates in a distributed fashion, the WSN runs the local (L) LMS and the DLMS algorithm outlined in Section 4.1. The LLMS is a noncooperative scheme since each sensor, say the th, independently runs an LMS adaptive filter fed by its local data only. The experiment involving DLMS is performed under ideal and noisy intersensor links. Clearly, even in the presence of communication noise DLMS exploits the spatial diversity available and allows all sensors to estimate accurately the actual spectral peak, whereas LLMS leads the problematic sensors to misleading estimates.
For the same setup, Fig. 6 (right) shows the global learning curve evolution . The DLMS and the DRLS algorithms are compared under ideal communication links. It is apparent that DRLS achieves improved performance both in terms of convergence rate and steady state MSE. As discussed in Section 4.2 this comes at the price of increased computational complexity per sensor, while the communication costs incurred are identical.
4.3 Decentralized Modelbased Tracking
The decentralized adaptive schemes in Secs. 4.1 and 4.2
are suitable for tracking slowly timevarying signals in settings where no statistical models are available. In certain cases, such as target tracking, state evolution models can be derived and employed by exploiting the physics of the problem. The availability of such models paves the way for improved state tracking via Kalman filtering/smoothing techniques, e.g., see
Opt_Filtering_Moore ; Estimation_Theory . Modelbased decentralized Kalman filtering/smoothing as well as particle filtering schemes for multinode networks are briefly outlined here.Initial attempts to distribute the centralized KF recursions (see Olfati_Kalman and references in sgrr08tsp ) rely on consensusaveraging Consensus_Averaging . The idea is to estimate across nodes those sufficient statistics (that are expressible in terms of networkwide averages) required to form the corrected state and corresponding corrected state error covariance matrix. Clearly, there is an inherent delay in obtaining these estimates confining the operation of such schemes only to applications with slowvarying state vectors , and/or fast communications needed to complete multiple consensus iterations within the time interval separating the acquisition of consecutive measurements and . Other issues that may lead to instability in existing decentralized KF approaches are detailed in sgrr08tsp .
Instead of filtering, the delay incurred by those innerloop consensus iterations motivated the consideration of fixedlag decentralized Kalman smoothing (KS) in sgrr08tsp . Matching consensus iterations with those time instants of data acquisition, fixedlag smoothers allow sensors to form local MMSE optimal smoothed estimates, which take advantage of all acquired measurements within the “waiting period.” The ADMMenabled decentralized KS in sgrr08tsp also overcomes the noiserelated limitations of consensusaveraging algorithms xbk07jpdc . In the presence of communication noise, these estimates converge in the mean sense, while their noiseinduced variance remains bounded. This noise resiliency allows sensors to exchange quantized data further lowering communication cost. For a tutorial treatment of decentralized Kalman filtering approaches using WSNs (including the decentralized ADMMbased KS of sgrr08tsp and strategies to reduce the communication cost of state estimation problems), the interested reader is referred to dkf_control_mag . These reducedcost strategies exploit the redundancy in information provided by individual observations collected at different sensors, different observations collected at different sensors, and different observations acquired at the same sensor.
On a related note, a collaborative algorithm is developed in cg_cartography to estimate the channel gains of wireless links in a geographical area. Kriged Kalman filtering (KKF) ripley , which is a tool with widely appreciated merits in spatial statistics and geosciences, is adopted and implemented in a decentralized fashion leveraging the ADMM framework described here. The distributed KKF algorithm requires only local message passing to track the timevariant sotermed “shadowing field” using a network of radiometers, yet it provides a global view of the radio frequency (RF) environment through consensus iterations; see also Section 5.3 for further elaboration on spectrum sensing carried out via wireless cognitive radio networks.
To wrapup the discussion, consider a network of collaborating agents (e.g., robots) equipped with wireless sensors measuring distance and/or bearing from a target that they wish to track. Even if state models are available, the nonlinearities present in these measurements prevent sensors from employing the clairvoyant (linear) Kalman tracker discussed so far. In response to these challenges, dpf develops a setmembership constrained particle filter (PF) approach that: (i) exhibits performance comparable to the centralized PF; (ii) requires only communication of particle weights among neighboring sensors; and (iii) it can afford both consensusbased and incremental averaging implementations. Affordable intersensor communications are enabled through a novel distributed adaptation scheme, which considerably reduces the number of particles needed to achieve a given performance. The interested reader is referred to dpf_tutorial for a recent tutorial account of decentralized PF in multiagent networks.
5 Decentralized Sparsityregularized Rank Minimization
Modern network data sets typically involve a large number of attributes. This fact motivates predictive models offering a sparse, broadly meaning parsimonious, representation in terms of a few attributes. Such lowdimensional models facilitate interpretability and enhanced predictive performance. In this context, this section deals with ADMMbased decentralized algorithms for sparsityregularized rank minimization. It is argued that such algorithms are key to unveiling Internet traffic anomalies given ubiquitous linkload measurements. Moreover, the notion of RF cartography is subsequently introduced to exemplify the development of a paradigm infrastructure for situational awareness at the physical layer of wireless cognitive radio (CR) networks. A (subsumed) decentralized sparse linear regression algorithm is outlined to accomplish the aforementioned cartography task.
5.1 Network Anomaly Detection Via Sparsity and Low Rank
Consider a backbone IP network, whose abstraction is a graph with nodes (routers) and physical links. The operational goal of the network is to transport a set of origindestination (OD) traffic flows associated with specific OD (ingressegress router) pairs. Let denote the traffic volume (in bytes or packets) passing through link over a fixed time interval . Link counts across the entire network are collected in the vector , e.g., using the ubiquitous SNMP protocol. Singlepath routing is adopted here, meaning a given flow’s traffic is carried through multiple links connecting the corresponding sourcedestination pair along a single path. Accordingly, over a discrete time horizon the measured link counts and (unobservable) OD flow traffic matrix , are thus related through lakhina , where the sotermed routing matrix is such that if link carries the flow , and zero otherwise. The routing matrix is ‘wide,’ as for backbone networks the number of OD flows is much larger than the number of physical links . A cardinal property of the traffic matrix is noteworthy. Common temporal patterns across OD traffic flows in addition to their almost periodic behavior, render most rows (respectively columns) of the traffic matrix linearly dependent, and thus typically has low rank. This intuitive property has been extensively validated with real network data; see Fig. 7 and e.g., lakhina .
It is not uncommon for some of the OD flow rates to experience unexpected abrupt changes. These sotermed traffic volume anomalies are typically due to (unintentional) network equipment misconfiguration or outright failure, unforeseen behaviors following routing policy modifications, or, cyber attacks (e.g., DoS attacks) which aim at compromising the services offered by the network zggr05 ; lakhina ; mrspmag13 . Let denote the unknown amount of anomalous traffic in flow at time , which one wishes to estimate. Explicitly accounting for the presence of anomalous flows, the measured traffic carried by link is then given by , where the noise variables capture measurement errors and unmodeled dynamics. Traffic volume anomalies are (unsigned) sudden changes in the traffic of OD flows, and as such their effect can span multiple links in the network. A key difficulty in unveiling anomalies from linklevel measurements only is that oftentimes, clearly discernible anomalous spikes in the flow traffic can be masked through “destructive interference” of the superimposed OD flows <
Comments
There are no comments yet.