Decentralized learning for wireless communications and networking

03/30/2015 ∙ by Georgios B. Giannakis, et al. ∙ The University of Texas at Arlington University of Minnesota University of Illinois at Urbana-Champaign University of Rochester USTC 0

This chapter deals with decentralized learning algorithms for in-network processing of graph-valued data. A generic learning problem is formulated and recast into a separable form, which is iteratively minimized using the alternating-direction method of multipliers (ADMM) so as to gain the desired degree of parallelization. Without exchanging elements from the distributed training sets and keeping inter-node communications at affordable levels, the local (per-node) learners consent to the desired quantity inferred globally, meaning the one obtained if the entire training data set were centrally available. Impact of the decentralized learning framework to contemporary wireless communications and networking tasks is illustrated through case studies including target tracking using wireless sensor networks, unveiling Internet traffic anomalies, power system state estimation, as well as spectrum cartography for wireless cognitive radio networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 28

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This chapter puts forth an optimization framework for learning over networks, that entails decentralized processing of training data acquired by interconnected nodes. Such an approach is of paramount importance when communication of training data to a central processing unit is prohibited due to e.g., communication cost or privacy reasons. The so-termed in-network processing paradigm for decentralized learning is based on successive refinements of local model parameter estimates maintained at individual network nodes. In a nutshell, each iteration of this broad class of fully decentralized algorithms comprises: (i) a communication step where nodes exchange information with their neighbors through e.g., the shared wireless medium or Internet backbone; and (ii) an update step where each node uses this information to refine its local estimate. Devoid of hierarchy and with their decentralized in-network processing, local e.g., estimators should eventually consent to the global estimator sought, while fully exploiting existing spatiotemporal correlations to maximize estimation performance. In most cases, consensus can formally be attained asymptotically in time. However, a finite number of iterations will suffice to obtain results that are sufficiently accurate for all practical purposes.

In this context, the approach followed here entails reformulating a generic learning task as a convex constrained optimization problem, whose structure lends itself naturally to decentralized implementation over a network graph. It is then possible to capitalize upon this favorable structure by resorting to the alternating-direction method of multipliers (ADMM), an iterative optimization method that can be traced back to Glowinski_Marrocco_ADMM_1975 (see also Gabay_Mercier_ADMM_1976 ), and which is specially well-suited for parallel processing bertsi97book ; Boyd_ADMM

. This way simple decentralized recursions become available to update each node’s local estimate, as well as a vector of dual prices through which network-wide agreement is effected.

Problem statement. Consider a network of nodes in which scarcity of power and bandwidth resources encourages only single-hop inter-node communications, such that the -th node communicates solely with nodes in its single-hop neighborhood . Inter-node links are assumed symmetric, and the network is modeled as an undirected graph whose vertices are the nodes and its edges represent the available communication links. As it will become clear through the different application domains studied here, nodes could be wireless sensors, wireless access points (APs), electrical buses, sensing cognitive radios, or routers, to name a few examples. Node acquires measurements stacked in the vector containing information about the unknown model parameters in , which the nodes need to estimate. Let collect measurements acquired across the entire network. Many popular centralized schemes obtain an estimate as follows

(1)

In the decentralized learning problem studied here though, the summands are assumed to be local cost functions only known to each node . Otherwise sharing this information with a centralized processor, also referred to as fusion center (FC), can be challenging in various applications of interest, or, it may be even impossible in e.g., wireless sensor networks (WSNs) operating under stringent power budget constraints. In other cases such as the Internet or collaborative healthcare studies, agents may not be willing to share their private training data but only the learning results. Performing the optimization (1) in a centralized fashion raises robustness concerns as well, since the central processor represents an isolated point of failure.

In this context, the objective of this chapter is to develop a decentralized algorithmic framework for learning tasks, based on in-network processing of the locally available data. The described setup naturally suggests three characteristics that the algorithms should exhibit: c1) each node should obtain an estimate of , which coincides with the corresponding solution of the centralized estimator (1) that uses the entire data ; c2) processing per node should be kept as simple as possible; and c3) the overhead for inter-node communications should be affordable and confined to single-hop neighborhoods. It will be argued that such an ADMM-based algorithmic framework can be useful for contemporary applications in the domain of wireless communications and networking.

Prior art. Existing decentralized solvers of (1

) can be classified in two categories: C1) those obtained by modifying centralized algorithms and operating in the primal domain; and C2) those handling an equivalent constrained form of (

1) (see (2) in Section 2), and operating in the primal-dual domain.

Primal-domain algorithms under C1 include the (sub)gradient method and its variants Nedic2009 ; Ram2010 ; Yuan2013 ; Jakovetic2013 , the incremental gradient method Rabbat2005-inc , the proximal gradient method Chen2012 , and the dual averaging method Duchi2012 ; Tsianos2012-acc . Each node in these methods, averages its local iterate with those of neighbors and descends along its local negative (sub)gradient direction. However, the resultant algorithms are limited to inexact convergence when using constant stepsizes Nedic2009 ; Yuan2013 . If diminishing stepsizes are employed instead, the algorithms can achieve exact convergence at the price of slowing down speed Jakovetic2013 ; Rabbat2005-inc ; Duchi2012 . A constant-stepsize exact first-order algorithm is also available to achieve fast and exact convergence, by correcting error terms in the distributed gradient iteration with two-step historic information Shi2014-extra .

Primal-dual domain algorithms under C2 solve an equivalent constrained form of (1), and thus drive local solutions to reach global optimality. The dual decomposition method is hence applicable because (sub)gradients of the dual function depend on local and neighboring iterates only, and can thus be computed without global cooperation Rabbat2005 . ADMM modifies the dual decomposition by regularizing the constraints with a quadratic term, which improves numerical stability as well as rate of convergence, as will be demonstrated later in this chapter. Per ADMM iteration, each node solves a subproblem that can be demanding. Fortunately, these subproblems can be solved inexactly by running one-step gradient or proximal gradient descent iterations, which markedly mitigate the computation burden Ling2014-icassp ; Chang2014-icassp . A sequential distributed ADMM algorithm can be found in Wei2012 .

Chapter outline. The remainder of this chapter is organized as follows. Section 2 describes a generic ADMM framework for decentralized learning over networks, which is at the heart of all algorithms described in the chapter and was pioneered in sg06asilomar ; srg08tsp for in-network estimation using WSNs. Section 3 focuses on batch estimation as well as (un)supervised inference, while Section 4

deals with decentralized adaptive estimation and tracking schemes where network nodes collect data sequentially in time. Internet traffic anomaly detection and spectrum cartography for wireless CR networks serve as motivating applications for the sparsity-regularized rank minimization algorithms developed in Section

5. Fundamental results on the convergence and convergence rate of decentralized ADMM are stated in Section 6.

2 In-Network Learning with ADMM in a Nutshell

Since local summands in (1) are coupled through a global variable , it is not straightforward to decompose the unconstrained optimization problem in (1). To overcome this hurdle, the key idea is to introduce local variables which represent local estimates of per network node  sg06asilomar ; srg08tsp . Accordingly, one can formulate the constrained minimization problem

(2)

The “consensus” equality constraints in (2) ensure that local estimates coincide within neighborhoods. Further, if the graph is connected then consensus naturally extends to the whole network, and it turns out that problems (1) and (2) are equivalent in the sense that  srg08tsp . Interestingly, the formulation in (2) exhibits a separable structure that is amenable to decentralized minimization. To leverage this favorable structure, the alternating direction method of multipliers (ADMM), see e.g., (bertsi97book, , pg. 253-261), can be employed here to minimize (2) in a decentralized fashion. This procedure will yield a distributed estimation algorithm whereby local iterates , with denoting iterations, provably converge to the centralized estimate in (1); see also Section 6.

To facilitate application of ADMM, consider the auxiliary variables , and reparameterize the constraints in (2) with the equivalent ones

s. to (3)

Variables are only used to derive the local recursions but will be eventually eliminated. Attaching Lagrange multipliers to the constraints (2), consider the augmented Lagrangian function

(4)

where the constant is a penalty coefficient. To minimize (2), ADMM entails an iterative procedure comprising three steps per iteration

[S1]

Multiplier updates:

[S2]

Local estimate updates:

[S3]

Auxiliary variable updates:

where and in [S1]. Reformulating the generic learning problem (1) as (2) renders the augmented Lagrangian in (2) highly decomposable. The separability comes in two flavors, both with respect to the sets and of primal variables, as well as across nodes . This in turn leads to highly parallelized, simplified recursions corresponding to the aforementioned steps [S1]-[S3]. Specifically, as detailed in e.g., srg08tsp ; sgrr08tsp ; SMG_D_LMS ; pfacgg10jmlr ; mateos_dlasso ; mmg13tsp , it follows that if the multipliers are initialized to zero, the ADMM-based decentralized algorithm reduces to the following updates carried out locally at every node In-network learning algorithm at node , for :

(5)
(6)

where , and all initial values are set to zero.

Recursions (5) and (6) entail local updates, which comprise the general purpose ADMM-based decentralized learning algorithm. The inherently redundant set of auxiliary variables in and corresponding multipliers have been eliminated. Each node, say the -th one, does not need to separately keep track of all its non-redundant multipliers , but only to update the (scaled) sum . In the end, node has to store and update only two -dimensional vectors, namely and . A unique feature of in-network processing is that nodes communicate their updated local estimates (and not their raw data ) with their neighbors, in order to carry out the tasks (5)-(6) for the next iteration.

As elaborated in Section 6, under mild assumptions on the local costs one can establish that , for . As a result, the algorithm asymptotically attains consensus and the performance of the centralized estimator [cf. (1)].

3 Batch In-Network Estimation and Inference

3.1 Decentralized Signal Parameter Estimation

Many workhorse estimation schemes such as maximum likelihood estimation (MLE), least-squares estimation (LSE), best linear unbiased estimation (BLUE), as well as linear minimum mean-square error estimation (LMMSE) and the maximum a posteriori (MAP) estimation, all can be formulated as a minimization task similar to (

1); see e.g. Estimation_Theory . However, the corresponding centralized estimation algorithms fall short in settings where both the acquired measurements and computational capabilities are distributed among multiple spatially scattered sensing nodes, which is the case with WSNs. Here we outline a novel batch decentralized optimization framework building on the ideas in Section 2, that formulates the desired estimator as the solution of a separable constrained convex minimization problem tackled via ADMM; see e.g., bertsi97book ; Boyd_ADMM ; srg08tsp ; sgrr08tsp for further details on the algorithms outlined here.

Depending on the estimation technique utilized, the local cost functions in (1) should be chosen accordingly, see e.g., Estimation_Theory ; srg08tsp ; sgrr08tsp . For instance, when is assumed to be an unknown deterministic vector, then:

  • If corresponds to the centralized MLE then

    is the negative log-likelihood capturing the data probability density function (pdf), while the network-wide data

    are assumed statistically independent.

  • If corresponds to the BLUE (or weighted least-squares estimator) then , where denotes the covariance of the data , and is a known fitting matrix.

When is treated as a random vector, then:

  • If corresponds to the centralized MAP estimator then accounts for the data pdf, and for the prior pdf of , while data are assumed conditionally independent given .

  • If corresponds to the centralized LMMSE then , where denotes the cross-covariance of with , while stands for the -th block subvector of .

Substituting in (6) the specific for each of the aforementioned estimation tasks, yields a family of batch ADMM-based decentralized estimation algorithms. The decentralized BLUE algorithm will be described in this section as an example of decentralized linear estimation.

Recent advances in cyber-physical systems have also stressed the need for decentralized nonlinear least-squares (LS) estimation. Monitoring the power grid for instance, is challenged by the nonconvexity arising from the nonlinear AC power flow model; see e.g., (Wollenberg-book, , Ch. 4), while the interconnection across local transmission systems motivates their operators to collaboratively monitor the global system state. Interestingly, this nonlinear (specifically quadratic) estimation task can be convexified to a semidefinite program (SDP) (Boyd_Convex, , pg. 168), for which a decentralized semidefinite programming (SDP) algorithm can be developed by leveraging the batch ADMM; see also Wen2010 for an ADMM-based centralized SDP precursor.

Decentralized BLUE

The minimization involved in (6) can be performed locally at sensor by employing numerical optimization techniques Boyd_Convex . There are cases where the minimization in (6) yields a closed-form and easy to implement updating formula for . If for example network nodes wish to find the BLUE estimator in a distributed fashion, the local cost is , and (6) becomes a strictly convex unconstrained quadratic program which admits the following closed-form solution (see details in srg08tsp ; MSG_D_RLS )

(7)

The pair (5) and (7) comprise the decentralized (D-) BLUE algorithm sg06asilomar ; srg08tsp

. For the special case where each node acquires unit-variance scalar observations

, there is no fitting matrix and is scalar (i.e., ); D-BLUE offers a decentralized algorithm to obtain the network-wide sample average . The update rule for the local estimate is obtained by suitably specializing (7) to

(8)

Different from existing distributed averaging approaches Barbarossa_Scu_Coupled_Osci ; dimakis10 ; Consensus_Averaging ; xbk07jpdc , the ADMM-based one originally proposed in sg06asilomar ; srg08tsp allows the decentralized computation of general nonlinear estimators that may be not available in closed form and cannot be expressed as “averages.” Further, the obtained recursions exhibit robustness in the presence of additive noise in the inter-node communication links.

Decentralized SDP

Consider now that each scalar in adheres to a quadratic measurement model in plus additive Gaussian noise, where the centralized MLE requires solving a nonlinear least-squares problem. To tackle the nonconvexity due to the quadratic dependence, the task of estimating the state can be reformulated as that of estimating the outer-product matrix . In this reformulation is a linear function of , given by with a known matrix  hzgg_jstsp14 . Motivated by the separable structure in (2), the nonlinear estimation problem can be similarly formulated as

s. to
(9)

where the positive-semidefiniteness and rank constraints ensure that each matrix is an outer-product matrix. By dropping the non-convex rank constraints, the problem (3.1) becomes a convex semidefinite program (SDP), which can be solved in a decentralized fashion by adopting the batch ADMM iterations (5) and (6).

This decentralized SDP approach has been successfully employed for monitoring large-scale power networks gg_spmag13 . To estimate the complex voltage phasor all nodes (a.k.a. power system state), measurements are collected on real/reactive power and voltage magnitude, all of which have quadratic dependence on the unknown states. Gauss-Newton iterations have been the ‘workhorse’ tool for this nonlinear estimation problem; see e.g., SE_book ; Wollenberg-book . However, the iterative linearization therein could suffer from convergence issues and local optimality, especially due to the increasing variability in power grids with high penetration of renewables. With improved communication capabilities, decentralized state estimation among multiple control centers has attracted growing interest; see Fig. 1 illustrating three interconnected areas aiming to achieve the centralized estimation collaboratively.

Figure 1: (Left:) Schematic of collaborative power system state estimation among control centers of three interconnected networks (IEEE 118-bus test case). (Right:) Local state estimation error vs. iteration number using the decentralized SDP-based state estimation method.

A decentralized SDP-based state estimator has been developed in hzgg_jstsp14 with reduced complexity compared to (3.1). The resultant algorithm involves only internal voltages and those of next-hop neighbors in the local matrix ; e.g., in Fig. 1 is identified by the dashed lines. Interestingly, the positive-semidefiniteness constraint for the overall decouples nicely into that of all local , and the estimation error converges to the centralized performance within only a dozen iterations. The decentralized SDP framework has successfully addressed a variety of power system operational challenges, including a distributed microgrid optimal power flow solver in edhzgg_tsg13 ; see also gg_spmag13 for a tutorial overview of these applications.

3.2 Decentralized Inference

Along with decentralized signal parameter estimation, a variety of inference tasks become possible by relying on the collaborative sensing and computations performed by networked nodes. In the special context of resource-constrained WSNs deployed to determine the common messages broadcast by a wireless AP, the relatively limited node reception capability makes it desirable to design a decentralized detection scheme for all sensors to attain sufficient statistics for the global problem. Another exciting application of WSNs is environmental monitoring for e.g., inferring the presence or absence of a pollutant over a geographical area. Limited by the local sensing capability, it is important to develop a decentralized learning framework such that all sensors can collaboratively approach the performance as if the network wide data had been available everywhere (or at a FC for that matter). Given the diverse inference tasks, the challenge becomes how to design the best inter-node information exchange schemes that would allow for minimal communication and computation overhead in specific applications.

Decentralized Detection

Message decoding. A decentralized detection framework is introduced here for the message decoding task, which is relevant for diverse wireless communications and networking scenarios. Consider an AP broadcasting a coded block to a network of sensors, all of which know the codebook that belongs to. For simplicity assume binary codewords, and that each node receives a same-length block of symbols through a discrete, memoryless, symmetric channel that is conditionally independent across sensors. Sensor knows its local channel from the AP, as characterized by the conditional pdf per bit . Due to conceivably low signal-to-noise-ratio (SNR) conditions, each low-cost sensor may be unable to reliably decode the message. Accordingly, the need arises for information exchanges among single-hop neighboring sensors to achieve the global (that is, centralized) error performance. Given per sensor , the assumption on memoryless and independent channels yields the centralized maximum-likelihood (ML) decoder as

(10)

ML decoding amounts to deciding the most likely codeword among multiple candidate ones and, in this sense, it can be viewed as a test of multiple hypotheses. In this general context, belief propagation approaches have been developed in sas06tsp , so that all nodes can cooperate to learn the centralized likelihood per hypothesis. However, even for linear binary block codes, the number of hypotheses, namely the cardinality of , grows exponentially with the codeword length. This introduces high communication and computation burden for the low-cost sensor designs.

The key here is to extract minimal sufficient statistics for the centralized decoding problem. For binary codes, the log-likelihood terms in (10) become , where

(11)

is the local log-likelihood ratio (LLR) for the bit at sensor . Ignoring all constant terms , the ML decoding objective ends up only depending on the sum LLRs, as given by . Clearly, the sufficient statistic for solving (10) is the sum of all local LLR terms, or equivalently, the average for each bit . Interestingly, the average of is one instance of the BLUE discussed in Section 3.1 when , since

(12)

This way, the ADMM-based decentralized learning framework in Section 2 allows for all sensors to collaboratively attain the sufficient statistic for the decoding problem (10) via in-network processing. Each sensor only needs to estimate a vector of the codeword length , which bypasses the exponential complexity under the framework of belief propagation. As shown in hzggac08tsp , decentralized soft decoding is also feasible since the a posteriori probability (APP) evaluator also relies on LLR averages which are sufficient statistics, where extensions to non-binary alphabet codeword constraints and random failing inter-sensor links are also considered.

Figure 2: BER vs. SNR (in dB) curves depicting the local ML decoder vs. the consensus-averaging decoder vs. the ADMM-based approach vs. the centralized ML decoder benchmark.

The bit error rate (BER) versus SNR plot in Fig. 2 demonstrates the performance of ADMM-based in-network decoding of a convolutional code with and . This numerical test involves sensors and AWGN AP-sensor channels with . Four schemes are compared: (i) the local ML decoder based on per-sensor data only (corresponds to the curve marked as since it is used to initialize the decentralized iterations); (ii) the centralized benchmark ML decoder (corresponds to ); (iii) the in-network decoder which forms using “consensus-averaging” linear iterations Consensus_Averaging ; and, (iv) the ADMM-based decentralized algorithm. Indeed, the ADMM-based decoder exhibits faster convergence than its consensus-averaging counterpart; and surprisingly, only 10 iterations suffice to bring the decentralized BER very close to the centralized performance.

Message demodulation. In a related detection scenario the common AP message can be mapped to a space-time matrix, with each entry drawn from a finite alphabet . The received block per sensor typically admits a linear input/output relationship . Matrix is formed from the fading AP-sensor channel, and stands for the additive white Gaussian noise of unit variance, that is assumed uncorrelated across sensors. Since low-cost sensors have very limited budget on number of antennas compared to the AP, the length of is much shorter than (i.e., ). Hence, the local linear demodulator using may not even be able to identify . Again, it is critical for each sensor to cooperate with its neighbors to collectively form the global ML demodulator

(13)

where and are the sample (cross-)covariance terms. To solve (13) locally, it suffices for each sensor to acquire the network-wide average of , as well as that of , as both averages constitute the minimal sufficient statistics for the centralized demodulator. Arguments similar to decentralized decoding lead to ADMM iterations that (as with BLUE) attain locally these average terms. These iterations constitute a viable decentralized demodulation method, whose performance analysis in hzacgg10twc reveals that its error diversity order can approach the centralized one within only a dozen of iterations.

As demonstrated by the decoding and demodulation tasks, the cornerstone of developing a decentralized detection scheme is to extract the minimal sufficient statistics for the centralized hypothesis testing problem. This leads to significant complexity reduction in terms of communications and computational overhead.

Decentralized Support Vector Machines

The merits of support vector machines (SVMs) in a centralized setting have been well documented in various supervised classification tasks including surveillance, monitoring, and segmentation, see e.g., 

smola . These applications often call for

decentralized supervised learning

solutions, when limited training data are acquired at different locations and a central processing unit is costly or even discouraged due to, e.g., scalability, communication overhead, or privacy reasons. Noteworthy examples include WSNs for environmental or structural health monitoring, as well as diagnosis of medical conditions from patient’s records distributed at different hospitals.

In this in-network classification task, a labeled training set of size is available per node , where is the input data vector and denotes its corresponding class label. Given all network-wide training data , the centralized SVM seeks a maximum-margin linear discriminant function , by solving the following convex optimization problem smola

(14)
s. to

where the slack variables account for non-linearly separable training sets, and is a tunable positive scalar that allows for controlling model complexity. Nonlinear discriminant functions can also be accommodated after mapping input vectors to a higher- (possibly infinite)-dimensional space using e.g., kernel functions, and pursuing a generalized maximum-margin linear classifier as in (14). Since the SVM classifier (14) couples the local datasets, early distributed designs either rely on a centralized processor so they are not decentralized van08dpsvm , or, their performance is not guaranteed to reach that of the centralized SVM navia06dsvm .

A fresh view of decentralized SVM classification is taken in pfacgg10jmlr , which reformulates (14) to estimate the parameter pair from all local data after eliminating slack variables , namely

(15)

Notice that (15) has the same decomposable structure that the general decentralized learning task in (1), upon identifying the local cost , where , and . Accordingly, all network nodes can solve (15) in a decentralized fashion via iterations obtained following the ADMM-based algorithmic framework of Section 2. Such a decentralized ADMM-DSVM scheme is provably convergent to the centralized SVM classifier (14), and can also incorporate nonlinear discriminant functions as detailed in pfacgg10jmlr .

Figure 3: Decision boundary comparison among ADMM-DSVM, centralized SVM and local SVM results for synthetic data generated from two Gaussian classes, and a network of nodes.

To illustrate the performance of the ADMM-DSVM algorithm in pfacgg10jmlr , consider a randomly generated network with nodes. Each node acquires labeled training examples from two different classes, which are equiprobable and consist of random vectors drawn from a two-dimensional (i.e.,

) Gaussian distribution with common covariance matrix

, and mean vectors and , respectively. The Bayes optimal classifier for this 2-class problem is linear (duda, , Ch. 2). To visualize this test case, Fig. 3 depicts the global training set, along with the linear discriminant functions found by the centralized SVM (14) and the ADMM-DSVM at two different nodes after 400 iterations. Local SVM results for two different nodes are also included for comparison. It is apparent that ADMM-DSVM approaches the decision rule of its centralized counterpart, whereas local classifiers deviate since they neglect most of the training examples in the network.

Decentralized Clustering

Unsupervised learning using a network of wireless sensors as an exploratory infrastructure is well motivated for inferring hidden structures in distributed data collected by the sensors. Different from supervised SVM-based classification tasks, each node has available a set of unlabeled observations , drawn from a total of classes. In this network setting, the goal is to design local clustering rules assigning each to a cluster . Again, the desiderata is a decentralized algorithm capable of attaining the performance of a benchmark clustering scheme, where all are centrally available for joint processing.

Various criteria are available to quantify similarity among observations in a centralized setting, and a popular selection is the deterministic partitional clustering (DPC) one entailing prototypical elements (a.k.a. cluster centroids) per class in order to avoid comparisons between every pair of observations. Let denote the prototype element for class , and the membership coefficient of to class . A natural clustering problem amounts to specifying the family of clusters with centroids , such that the sum of squared-errors is minimized; that is

(16)

where is a tuning parameter, and denotes the convex set of constraints on all membership coefficients. With and fixed, (16

) becomes a linear program in

. Consequently, (16) admits binary optimal solutions giving rise to the so-termed hard assignments, by choosing the cluster for whenever . Otherwise, for the optimal coefficients generally result in soft membership assignments, and the optimal cluster is for . In either case, the DPC clustering problem (16

) is NP-hard, which motivates the (suboptimal) K-means algorithm that, on a per iteration basis, proceeds in two-steps to minimize the cost in (

16) w.r.t.: (S1) with fixed; and (S2) with fixed lloyd82PCM . Convergence of this two-step alternating-minimization scheme is guaranteed at least to a local minimum. Nonetheless, K-means requires central availability of global information (those variables that are fixed per step), which challenges in-network implementations. For this reason, most early attempts are either confined to specific communication network topologies, or, they offer no closed-form local solutions; see e.g., Nowak03dem ; whk08ICML .

To address these limitations, pfacgg11jstsp casts (16) [yet another instance of (1)] as a decentralized estimation problem. It is thus possible to leverage ADMM iterations and solve (16) in a decentralized fashion through information exchanges among single-hop neighbors only. Albeit the non-convexity of (16), the decentralized DPC iterations in pfacgg11jstsp provably approach a local minimum arbitrarily closely, where the asymptotic convergence holds for hard K-means with . Further extensions in pfacgg11jstsp

include a decentralized expectation-maximization algorithm for probabilistic partitional clustering, and methods to handle unknown number of classes.

Figure 4: Average performance of hard-DKM on a real data set using a WSN with nodes for various values of and (left). Clustering with and (right) at iterations.

Clustering of oceanographic data. Environmental monitoring is a typical application of WSNs. In WSNs deployed for oceanographic monitoring, the cost of computation per node is lower than the cost of accessing each node’s observations oceansensors . This makes the option of centralized processing less attractive, thus motivating decentralized processing. Here we test the decentralized DPC schemes of pfacgg11jstsp on real data collected by multiple underwater sensors in the Mediterranean coast of Spain WOD , with the goal of identifying regions sharing common physical characteristics. A total of feature vectors were selected, each having entries the temperature (C) and salinity (psu) levels (). The measurements were normalized to have zero mean, unit variance, and they were grouped in blocks (one per sensor) of measurements each. The algebraic connectivity of the WSN is 0.2289 and the average degree per node is 4.9. Fig. 4 (left) shows the performance of 25 Monte Carlo runs for the hard-DKM algorithm with different values of the parameter . The best average convergence rate was obtained for , attaining the average centralized performance after 300 iterations. Tests with different values of and are also included in Fig. 4 (left) for comparison. Note that for and hard-DKM hovers around a point without converging. Choosing a larger guarantees convergence of the algorithm to a unique solution. The clustering results of hard-DKM at iterations for and are depicted in Fig. 4 (right).

4 Decentralized Adaptive Estimation

Sections 2 and 3 dealt with decentralized batch estimation, whereby network nodes acquire data only once and then locally exchange messages to reach consensus on the desired estimators. In many applications however, networks are deployed to perform estimation in a constantly changing environment without having available a complete statistical description of the underlying processes of interest, e.g., with time-varying thermal or seismic sources. This motivates the development of decentralized adaptive estimation schemes, where nodes collect data sequentially in time and local estimates are recursively refined “on-the-fly.” In settings where statistical state models are available, it is prudent to develop model-based tracking approaches implementing in-network Kalman or particle filters. Next, Section 2’s scope is broadened to facilitate real-time (adaptive) processing of network data, when the local costs in (1) and unknown parameters are allowed to vary with time.

4.1 Decentralized Least-Mean Squares

A decentralized least-mean squares (LMS) algorithm is developed here for adaptive estimation of (possibly) nonstationary parameters, even when statistical information such as ensemble data covariances are unknown. Suppose network nodes are deployed to estimate a signal vector in a collaborative fashion subject to single-hop communication constraints, by resorting to the linear LMS criterion, see e.g., sk95book ; Diffusion_LMS ; SMG_D_LMS . Per time instant , each node has available a regression vector and acquires a scalar observation , both assumed zero-mean without loss of generality. Introducing the global vector and matrix , the global time-dependent LMS estimator of interest can be written as (sk95book, ; Diffusion_LMS, ; SMG_D_LMS, , p. 14)

(17)

For jointly wide-sense stationary , solving (17) leads to the well-known Wiener filter estimate , where and ; see e.g., (sk95book, , p. 15).

For the cases where the auto- and cross-covariance matrices and are unknown, the approach followed here to develop the decentralized (D-) LMS algorithm includes two main building blocks: (i) recast (17) into an equivalent form amenable to in-network processing via the ADMM framework of Section 2; and (ii) leverage stochastic approximation iterations kushner to obtain an adaptive LMS-like algorithm that can handle the unavailability/variation of statistical information. Following those algorithmic construction steps outlined in Section 2, the following updating recursions are obtained for the multipliers and the local estimates at time instant and

(18)
(19)

It is apparent that after differentiating (19) and setting the gradient equal to zero, can be obtained as the root of an equation of the form

(20)

where corresponds to the stochastic gradient of the cost in (19). However, the previous equation cannot be solved since the nodes do not have available any statistical information about the acquired data. Inspired by stochastic approximation techniques (such as the celebrated Robbins-Monro algorithm; see e.g.,(kushner, , Ch. 1)) which iteratively find the root of (20) given noisy observations , one can just drop the unknown expected value to obtain the following D-LMS (i.e., stochastic gradient) updates

(21)

where denotes a constant step-size, and is twice the local a priori error.

Recursions (18) and (21) constitute the D-LMS algorithm, which can be viewed as a stochastic-gradient counterpart of D-BLUE in Section 3.1. D-LMS is a pioneering approach for decentralized online learning, which blends for the first time affordable (first-order) stochastic approximation steps with parallel ADMM iterations. The use of a constant step-size endows D-LMS with tracking capabilities. This is desirable in a constantly changing environment, within which e.g., WSNs are envisioned to operate. The D-LMS algorithm is stable and converges even in the presence of inter-node communication noise (see details in SMG_D_LMS ; MSG_D_LMS ). Further, closed-form expressions for the evolution and the steady-state mean-square error (MSE), as well as selection guidelines for the step-size can be found in MSG_D_LMS .

Figure 5: Tracking with D-LMS. (left) Local MSE performance metrics both with and without inter-node communication noise for sensors 3 and 12; and (right) True and estimated time-varying parameters for a representative node, using slow and optimal adaptation levels.

Here we test the tracking performance of D-LMS with a computer simulation. For a random geometric graph with nodes, network-wide observations are linearly related to a large-amplitude slowly time-varying parameter vector . Specifically, , where with

. The driving noise is normally distributed with

. To model noisy links, additive white Gaussian noise with variance is present at the receiving end. For , Fig. 5 (left) depicts the local performance of two representative nodes through the evolution of the excess mean-square error and the mean-square deviation figures of merit. Both noisy and ideal links are considered, and the empirical curves closely follow the theoretical trajectories derived in MSG_D_LMS . Steady-state limiting values are also extremely accurate. As intuitively expected and suggested by the analysis, a performance penalty due to non-ideal links is also apparent. Fig. 5 (right) illustrates how the adaptation level affects the resulting per-node estimates when tracking time-varying parameters with D-LMS. For (slow adaptation) and (near optimal adaptation), we depict the third entry of the parameter vector and the respective estimates from the randomly chosen sixth node. Under optimal adaptation the local estimate closely tracks the true variations, while – as expected – for the smaller step-size D-LMS fails to provide an accurate estimate MSG_D_LMS ; sk95book .

4.2 Decentralized Recursive Least-Squares

The recursive least-squares (RLS) algorithm has well-appreciated merits for reducing complexity and storage requirements, in online estimation of stationary signals, as well as for tracking slowly-varying nonstationary processes sk95book ; Estimation_Theory . RLS is especially attractive when the state and/or data model are not available (as with LMS), and fast convergence rates are at a premium. Compared to the LMS scheme, RLS typically offers faster convergence and improved estimation performance at the cost of higher computational complexity. To enable these valuable tradeoffs in the context of in-network processing, the ADMM framework of Section 2 is utilized here to derive a decentralized (D-) RLS adaptive scheme that can be employed for distributed localization and power spectrum estimation (see also MSG_D_RLS ; MG_D_RLS for further details on the algorithmic construction and convergence claims).

Consider the data setting and linear regression task in Section

4.1. The RLS estimator for the unknown parameter minimizes the exponentially weighted least-squares (EWLS) cost, see e.g., sk95book ; Estimation_Theory

(22)

where is a forgetting factor, while the positive definite matrix is included for regularization. Note that in forming the EWLS estimator at time , the entire history of data for is incorporated in the online estimation process. Whenever , past data are exponentially discarded thus enabling tracking of nonstationary processes.

Again to decompose the cost function in (22), in which summands are coupled through the global variable , we introduce auxiliary variables that represent local estimates per node . These local estimates are utilized to form the convex constrained and separable minimization problem in (2), which can be solved using ADMM to yield the following decentralized iterations (details in MSG_D_RLS ; MG_D_RLS )

(23)
(24)

where and

(25)
(26)

The D-RLS recursions (23) and (24) involve similar inter-node communication exchanges as in D-LMS. It is recommended to initialize the matrix recursion with , where is chosen sufficiently large sk95book . The local estimates in D-RLS converge in the mean-sense to the true (time-invariant case), even when information exchanges are imperfect. Closed-form expressions for the bounded estimation MSE along with numerical tests and comparisons with the incremental RLS inc_RLS and diffusion RLS Diffusion_RLS algorithms can be found in MG_D_RLS .

Decentralized spectrum sensing using WSNs. A WSN application where the need for linear regression arises, is spectrum estimation for the purpose of environmental monitoring. Suppose sensors comprising a WSN deployed over some area of interest observe a narrowband source to determine its spectral peaks. These peaks can reveal hidden periodicities due to e.g., a natural heat or seismic source. The source of interest propagates through multi-path channels and is contaminated with additive noise present at the sensors. The unknown source-sensor channels may introduce deep fades at the frequency band occupied by the source. Thus, having each sensor operating on its own may lead to faulty assessments. The available spatial diversity to effect improved spectral estimates, can only be achieved via sensor collaboration as in the decentralized estimation algorithms presented in this chapter.

Let denote the evolution of the source signal in time, and suppose that can be modeled as an autoregressive (AR) process (Stoica_Book, , p. 106)

where is the order of the AR process, while are the AR coefficients and

denotes driving white noise. The source propagates to sensor

via a channel modeled as an FIR filter , of unknown order and tap coefficients and is contaminated with additive sensing noise to yield the observation

Since is an autoregressive moving average (ARMA) process, then Stoica_Book

(27)

where the MA coefficients and the variance of the white noise process depend on , and the variance of the noise terms and . For the purpose of determining spectral peaks, the MA term in (27) can be treated as observation noise, i.e., . This is very important since this way sensors do not have to know the source-sensor channel coefficients as well as the noise variances. Accordingly, the spectral content of the source can be estimated provided sensors estimate the coefficients . To this end, let be the unknown parameter of interest. From (27) the regression vectors are given as , and can be acquired directly from the sensor measurements without the need of training/estimation.

Figure 6: D-LMS in a power spectrum estimation task. (left) The true narrowband spectra is compared to the estimated PSD, obtained after the WSN runs the D-LMS and (non-cooperative) L-LMS algorithms. The reconstruction results correspond to a sensor whose multipath channel from the source introduces a null at . (right) Global MSE evolution (network learning curve) for the D-LMS and D-RLS algorithms.

Performance of the decentralized adaptive algorithms described so far is illustrated next, when applied to the aforementioned power spectrum estimation task. For the numerical experiments, an ad hoc WSN with sensors is simulated as a realization of a random geometric graph. The source-sensor channels corresponding to a few of the sensors are set so that they have a null at the frequency where the AR source has a peak, namely at . Fig. 6 (left) depicts the actual power spectral density (PSD) of the source as well as the estimated PSDs for one of the sensors affected by a bad channel. To form the desired estimates in a distributed fashion, the WSN runs the local (L-) LMS and the D-LMS algorithm outlined in Section 4.1. The L-LMS is a non-cooperative scheme since each sensor, say the th, independently runs an LMS adaptive filter fed by its local data only. The experiment involving D-LMS is performed under ideal and noisy inter-sensor links. Clearly, even in the presence of communication noise D-LMS exploits the spatial diversity available and allows all sensors to estimate accurately the actual spectral peak, whereas L-LMS leads the problematic sensors to misleading estimates.

For the same setup, Fig. 6 (right) shows the global learning curve evolution . The D-LMS and the D-RLS algorithms are compared under ideal communication links. It is apparent that D-RLS achieves improved performance both in terms of convergence rate and steady state MSE. As discussed in Section 4.2 this comes at the price of increased computational complexity per sensor, while the communication costs incurred are identical.

4.3 Decentralized Model-based Tracking

The decentralized adaptive schemes in Secs. 4.1 and 4.2

are suitable for tracking slowly time-varying signals in settings where no statistical models are available. In certain cases, such as target tracking, state evolution models can be derived and employed by exploiting the physics of the problem. The availability of such models paves the way for improved state tracking via Kalman filtering/smoothing techniques, e.g., see

Opt_Filtering_Moore ; Estimation_Theory . Model-based decentralized Kalman filtering/smoothing as well as particle filtering schemes for multi-node networks are briefly outlined here.

Initial attempts to distribute the centralized KF recursions (see Olfati_Kalman and references in sgrr08tsp ) rely on consensus-averaging Consensus_Averaging . The idea is to estimate across nodes those sufficient statistics (that are expressible in terms of network-wide averages) required to form the corrected state and corresponding corrected state error covariance matrix. Clearly, there is an inherent delay in obtaining these estimates confining the operation of such schemes only to applications with slow-varying state vectors , and/or fast communications needed to complete multiple consensus iterations within the time interval separating the acquisition of consecutive measurements and . Other issues that may lead to instability in existing decentralized KF approaches are detailed in sgrr08tsp .

Instead of filtering, the delay incurred by those inner-loop consensus iterations motivated the consideration of fixed-lag decentralized Kalman smoothing (KS) in sgrr08tsp . Matching consensus iterations with those time instants of data acquisition, fixed-lag smoothers allow sensors to form local MMSE optimal smoothed estimates, which take advantage of all acquired measurements within the “waiting period.” The ADMM-enabled decentralized KS in sgrr08tsp also overcomes the noise-related limitations of consensus-averaging algorithms xbk07jpdc . In the presence of communication noise, these estimates converge in the mean sense, while their noise-induced variance remains bounded. This noise resiliency allows sensors to exchange quantized data further lowering communication cost. For a tutorial treatment of decentralized Kalman filtering approaches using WSNs (including the decentralized ADMM-based KS of sgrr08tsp and strategies to reduce the communication cost of state estimation problems), the interested reader is referred to dkf_control_mag . These reduced-cost strategies exploit the redundancy in information provided by individual observations collected at different sensors, different observations collected at different sensors, and different observations acquired at the same sensor.

On a related note, a collaborative algorithm is developed in cg_cartography to estimate the channel gains of wireless links in a geographical area. Kriged Kalman filtering (KKF) ripley , which is a tool with widely appreciated merits in spatial statistics and geosciences, is adopted and implemented in a decentralized fashion leveraging the ADMM framework described here. The distributed KKF algorithm requires only local message passing to track the time-variant so-termed “shadowing field” using a network of radiometers, yet it provides a global view of the radio frequency (RF) environment through consensus iterations; see also Section 5.3 for further elaboration on spectrum sensing carried out via wireless cognitive radio networks.

To wrap-up the discussion, consider a network of collaborating agents (e.g., robots) equipped with wireless sensors measuring distance and/or bearing from a target that they wish to track. Even if state models are available, the nonlinearities present in these measurements prevent sensors from employing the clairvoyant (linear) Kalman tracker discussed so far. In response to these challenges, dpf develops a set-membership constrained particle filter (PF) approach that: (i) exhibits performance comparable to the centralized PF; (ii) requires only communication of particle weights among neighboring sensors; and (iii) it can afford both consensus-based and incremental averaging implementations. Affordable inter-sensor communications are enabled through a novel distributed adaptation scheme, which considerably reduces the number of particles needed to achieve a given performance. The interested reader is referred to dpf_tutorial for a recent tutorial account of decentralized PF in multi-agent networks.

5 Decentralized Sparsity-regularized Rank Minimization

Modern network data sets typically involve a large number of attributes. This fact motivates predictive models offering a sparse, broadly meaning parsimonious, representation in terms of a few attributes. Such low-dimensional models facilitate interpretability and enhanced predictive performance. In this context, this section deals with ADMM-based decentralized algorithms for sparsity-regularized rank minimization. It is argued that such algorithms are key to unveiling Internet traffic anomalies given ubiquitous link-load measurements. Moreover, the notion of RF cartography is subsequently introduced to exemplify the development of a paradigm infrastructure for situational awareness at the physical layer of wireless cognitive radio (CR) networks. A (subsumed) decentralized sparse linear regression algorithm is outlined to accomplish the aforementioned cartography task.

5.1 Network Anomaly Detection Via Sparsity and Low Rank

Consider a backbone IP network, whose abstraction is a graph with nodes (routers) and physical links. The operational goal of the network is to transport a set of origin-destination (OD) traffic flows associated with specific OD (ingress-egress router) pairs. Let denote the traffic volume (in bytes or packets) passing through link over a fixed time interval . Link counts across the entire network are collected in the vector , e.g., using the ubiquitous SNMP protocol. Single-path routing is adopted here, meaning a given flow’s traffic is carried through multiple links connecting the corresponding source-destination pair along a single path. Accordingly, over a discrete time horizon the measured link counts and (unobservable) OD flow traffic matrix , are thus related through  lakhina , where the so-termed routing matrix is such that if link carries the flow , and zero otherwise. The routing matrix is ‘wide,’ as for backbone networks the number of OD flows is much larger than the number of physical links . A cardinal property of the traffic matrix is noteworthy. Common temporal patterns across OD traffic flows in addition to their almost periodic behavior, render most rows (respectively columns) of the traffic matrix linearly dependent, and thus typically has low rank. This intuitive property has been extensively validated with real network data; see Fig. 7 and e.g., lakhina .

Figure 7: Volumes of representative (out of total) OD flows, taken from the operation of Internet-2 during a seven-day period. Temporal periodicities and correlations across flows are apparent. As expected, in this case

can be well approximated by a low-rank matrix, since its normalized singular values decay rapidly to zero.

It is not uncommon for some of the OD flow rates to experience unexpected abrupt changes. These so-termed traffic volume anomalies are typically due to (unintentional) network equipment misconfiguration or outright failure, unforeseen behaviors following routing policy modifications, or, cyber attacks (e.g., DoS attacks) which aim at compromising the services offered by the network zggr05 ; lakhina ; mrspmag13 . Let denote the unknown amount of anomalous traffic in flow at time , which one wishes to estimate. Explicitly accounting for the presence of anomalous flows, the measured traffic carried by link is then given by , where the noise variables capture measurement errors and unmodeled dynamics. Traffic volume anomalies are (unsigned) sudden changes in the traffic of OD flows, and as such their effect can span multiple links in the network. A key difficulty in unveiling anomalies from link-level measurements only is that oftentimes, clearly discernible anomalous spikes in the flow traffic can be masked through “destructive interference” of the superimposed OD flows <