I Introduction
Let be a lowrank matrix [], and be a sparse matrix with support size considerably smaller than . Consider also a matrix and a set of index pairs that define a sampling of the entries of . Given and a number of (possibly) noise corrupted measurements
(1) 
the goal is to estimate lowrank and sparse
, by denoising the observed entries and imputing the missing ones. Introducing the sampling operator
which sets the entries of its matrix argument not in to zero and leaves the rest unchanged, the data model can be compactly written in matrix form as(2) 
A natural estimator accounting for the low rank of and the sparsity of will be sought to fit the data in the leastsquares (LS) error sense, as well as minimize the rank of , and the number of nonzero entries of measured by its (pseudo) norm; see e.g. [CP10], [MMG12], [CLMW09], [CSPW11] for related problems subsumed by the one described here. Unfortunately, both rank and norm minimization are in general NPhard problems [Natarajan_NP_duro, rank_NP_Duro]. Typically, the nuclear norm ( denotes the th singular value of ) and the norm are adopted as surrogates, since they are the closest convex approximants to and , respectively [RFP07, CT05, tropp06tit]. Accordingly, one solves
where are rank and sparsitycontrolling parameters. Being convex (P1) is appealing, and some of its special instances are known to attain good performance in theory and practice. For instance, when no data are missing (P1) can be used to unveil traffic anomalies in networks [MMG12]. Results in [MMG12] show that and can be exactly recovered in the absence of noise, even when is a fat (compression) operator. When
equals the identity matrix, (P1) reduces to the sotermed robust principal component analysis (PCA), for which exact recovery results are available in
[CLMW09] and [CSPW11]. Moreover, for the special case , (P1) offers a lowrank matrix completion alternative with welldocumented merits; see e.g., [CR08] and [CP10]. Stable recovery results in the presence of noise are also available for matrix completion and robust PCA [CP10, zlwcm10]. Earlier efforts dealing with the recovery of sparse vectors in noise led to similar performance guarantees; see e.g.,
[bickel09].In all these works, the samples and matrix are assumed centrally available, so that they can be jointly processed to estimate and by e.g., solving (P1). Collecting all this information can be challenging though in various applications of interest, or may be even impossible in e.g., wireless sensor networks (WSNs) operating under stringent power budget constraints. In other cases such as the Internet or collaborative marketing studies, agents providing private data for e.g., fitting a lowrank preference model, may not be willing to share their training data but only the learning results. Performing the optimization in a centralized fashion raises robustness concerns as well, since the central processor represents an isolated point of failure. Scalability is yet another driving force motivating distributed solutions. Several customized iterative algorithms have been proposed to solve instances of (P1), and have been shown effective in tackling low to mediumsize problems; see e.g., [MMG12], [CR08], [RFP07]
. However, most algorithms require computation of singular values per iteration and become prohibitively expensive when dealing with highdimensional data
[Recht_Parallel_2011]. All in all, the aforementioned reasons motivate the reducedcomplexity distributed algorithm for nuclear and norm minimization developed in this paper.In a similar vein, stochastic gradient algorithms were recently developed for largescale problems entailing regularization with the nuclear norm [Recht_Parallel_2011]. Even though iterations in [Recht_Parallel_2011] are highly paralellizable, they are not applicable to networks of arbitrary topology. There are also several studies on distributed estimation of sparse signals via norm regularized regression; see e.g., [mateos_dlasso, dusan11, chouvardas11]. Different from the treatment here, the data model of [mateos_dlasso] is devoid of a lowrank component and all the observations are assumed available (but distributed across several interconnected agents). Formally, the model therein is a special case of (2) with , , and , in which case (P1) boils down to finding the leastabsolute shrinkage and selection operator (Lasso) [tibshirani_lasso].
Building on the general model (2) and the centralized estimator (P1), this paper develops decentralized algorithms to estimate lowrank and sparse matrices, based on innetwork processing of a small subset of noisecorrupted and spatiallydistributed measurements (Section III). This is a challenging task however, since the nonseparable nuclearnorm present in (P1) is not amenable to distributed minimization. To overcome this limitation, an alternative characterization of the nuclear norm is adopted in Section IIIA, which leads to a separable yet nonconvex cost that is minimized via the alternatingdirection method of multipliers (ADMoM) [Bertsekas_Book_Distr]. The novel distributed iterations entail reducedcomplexity optimization subtasks per agent, and affordable message passing only between singlehop neighbors (Section IIIC). Interestingly, the distributed (nonconvex) estimator upon convergence provably attains the global optimum of its centralized counterpart (P1), regardless of initialization. To demonstrate the generality of the proposed estimator and its algorithmic framework, four networkingrelated application domains are outlined in Section IV, namely: i) unveiling traffic volume anomalies for largescale networks [LCD04, MMG12]; ii) robust PCA [CSPW11, CLMW09]; iii) lowrank matrix completion for networkwide path latency prediction [latencyprediction], and iv) spectrum sensing for cognitive radio (CR) networks [mateos_dlasso, juan_sensig]. Numerical tests with synthetic and real network data drawn from these application domains corroborate the effectiveness and convergence of the novel distributed algorithms, as well as their centralized performance benchmarks (Section V).
Section VI concludes the paper, while several technical details are deferred to the Appendix.
Notation: Bold uppercase (lowercase) letters will denote matrices (column vectors), and calligraphic letters will be used for sets. Operators , , , and , will denote transposition, matrix trace, maximum singular value, Hadamard product, and Kronecker product, respectively; will be used for the cardinality of a set, and the magnitude of a scalar. The matrix vectorization operator stacks the columns of matrix on top of each other to return a supervector, and its inverse is . The diagonal matrix has the entries of on its diagonal, and the positive semidefinite matrix will be denoted by . The norm of is for . For two matrices , denotes their trace inner product. The Frobenious norm of matrix is , is the spectral norm, is the norm, is the norm, and is the nuclear norm, where denotes the th singular value of . The identity matrix will be represented by , while will stand for vector of all zeros, and . Similar notations will be adopted for vectors (matrices) of all ones.
Ii Preliminaries and Problem Statement
Consider networked agents capable of performing some local computations, as well as exchanging messages among directly connected neighbors. An agent should be understood as an abstract entity, e.g., a sensor in a WSN, a router monitoring Internet traffic; or a sensing CR from a nextgeneration communications technology. The network is modeled as an undirected graph , where the set of nodes corresponds to the network agents, and the edges (links) in represent pairs of agents that can communicate. Agent communicates with its singlehop neighboring peers in , and the size of the neighborhood will be henceforth denoted by . To ensure that the data from an arbitrary agent can eventually percolate through the entire network, it is assumed that:
 (a1)

Graph is connected; i.e., there exists a (possibly) multihop path connecting any two agents.
With reference to the lowrank and sparse matrix recovery problem outlined in Section I, in the network setting envisioned here each agent acquires a few incomplete and noisecorrupted rows of matrix . Specifically, the local data available to agent is matrix , where , , and . The index pairs in are those in for which the row index matches the rows of observed by agent . Additionally, suppose that agent has available the local matrix , containing a row subset of associated with the observed rows in , i.e, . Agents collaborate to form the wanted estimator (P1) in a distributed fashion, which can be equivalently rewritten as
The objective of this paper is to develop a distributed algorithm for sparsityregularized rank minimization via (P1), based on innetwork processing of the locally available data. The described setup naturally suggests three characteristics that the algorithm should exhibit: c1) agent should obtain an estimate of and , which coincides with the corresponding solution of the centralized estimator (P1) that uses the entire data ; c2) processing per agent should be kept as simple as possible; and c3) the overhead for interagent communications should be affordable and confined to singlehop neighborhoods.
Iii Distributed Algorithm for InNetwork Operation
To facilitate reducing the computational complexity and memory storage requirements of the distributed algorithm sought, it is henceforth assumed that:
 (a2)

An upper bound on the rank of matrix obtained via (P1) is available.
As argued next, the smaller the value of , the more efficient the algorithm becomes. A small value of is well motivated in various applications. For example, the Internet traffic analysis of backbone networks in [LCD04] demonstrates that origintodestination flows have a very low intrinsic dimensionality, which renders the traffic matrix low rank; see also Section IVA. In addition, recall that is controlled by the choice of in (P1), and the rank of the solution can be made small enough, for sufficiently large . Because , (P1)’s search space is effectively reduced and one can factorize the decision variable as , where and are and matrices, respectively. Adopting this reparametrization of in (P1), one obtains the following equivalent optimization problem
which is nonconvex due to the bilinear terms , and . The number of variables is reduced from in (P1), to in (P2). The savings can be significant when is in the order of a few dozens, and both and are large. The dominant term in the variable count of (P3) is due to , which is sparse and can be efficiently handled even when both and are large. Problem (P3) is still not amenable to distributed implementation due to: (i) the nonseparable nuclear norm present in the cost function; and (ii) the global variables and coupling the peragent summands.
Iiia A separable nuclear norm regularization
To address (i), consider the following neat characterization of the nuclear norm [RFP07, Recht_Parallel_2011]
(3) 
For an arbitrary matrix with SVD , the minimum in (3) is attained for and . The optimization (3) is over all possible bilinear factorizations of , so that the number of columns of and is also a variable. Leveraging (3), the following reformulation of (P2) provides an important first step towards obtaining a distributed estimator:
As asserted in the following lemma, adopting the separable Frobeniusnorm regularization in (P3) comes with no loss of optimality, provided the upper bound is chosen large enough.
Lemma 1: Under (a2), (P3) is equivalent to (P1).
Proof:
Let denote the minimizer of (P1). Clearly, , implies that (P2) is equivalent to (P1). From (3) one can also infer that for every feasible solution , the cost in (P3) is no smaller than that of (P2). The gap between the globally minimum costs of (P2) and (P3) vanishes at , , and , where . Therefore, the cost functions of (P1) and (P3) are identical at the minimum.
Lemma IIIA ensures that by finding the global minimum of (P3) [which could have significantly less variables than (P1)], one can recover the optimal solution of (P1). However, since (P3) is nonconvex, it may have stationary points which need not be globally optimum. Interestingly, the next proposition shows that under relatively mild assumptions on
and the noise variance, every stationary point of (P3) is globally optimum for (P1). For a proof, see Appendix A.
Proposition 1: Let be a stationary point of (P3). If (no subscript in signifies spectral norm), then is the globally optimal solution of (P1).
Condition captures tacitly the role of , the number of columns of and in the postulated model. In particular, for sufficiently small the residual becomes large and consequently the condition is violated (unless is large enough, in which case a sufficiently lowrank solution to (P1) is expected). This is manifested through the fact that the extra condition in Lemma IIIA is no longer needed. In addition, note that the noise variance certainly affects the value of , and thus satisfaction of the said condition.
IiiB Local variables and consensus constraints
To decompose the cost function in (P3), in which summands are coupled through the global variables and [cf. (ii) at the beginning of this section], introduce auxiliary variables representing local estimates of per agent . These local estimates are utilized to form the separable constrained minimization problem
s. to  
For reasons that will become clear later on, additional variables were introduced to split the norm fittingerror part of the cost of (P4), from the norm regularization on the (cf. Remark 3). These extra variables are not needed if . The set of additional constraints ensures that, in this sense, nothing changes in going from (P3) to (P4). Most importantly, (P3) and (P4) are equivalent optimization problems under (a1). The equivalence should be understood in the sense that and likewise for , where and are the optimal solutions of (P4) and (P3), respectively. Of course, the corresponding estimates of will coincide as well. Even though consensus is a fortiori imposed within neighborhoods, it extends to the whole (connected) network and local estimates agree on the global solution of (P3). To arrive at the desired distributed algorithm, it is convenient to reparametrize the consensus constraints in (P4) as
(4)  
(5) 
where are auxiliary optimization variables that will be eventually eliminated.
IiiC The alternatingdirection method of multipliers
To tackle the constrained minimization problem (P4), associate Lagrange multipliers with the splitting constraints , . Likewise, associate additional dual variables and ( and ) with the first pair of consensus constraints in (4) [respectively (5)]. Next introduce the quadratically augmented Lagrangian function
(6) 
where is a positive penalty coefficient, and the primal variables are split into three groups , , and . For notational convenience, collect all multipliers in . Note that the remaining constraints in (4) and (5), namely , have not been dualized.
To minimize (P4) in a distributed fashion, a variation of the alternatingdirection method of multipliers (ADMoM) will be adopted here. The ADMoM is an iterative augmented Lagrangian method especially wellsuited for parallel processing [Bertsekas_Book_Distr], which has been proven successful to tackle the optimization tasks encountered e.g., with distributed estimation problems [Yannis_Ale_GG_PartI, mateos_dlasso]. The proposed solver entails an iterative procedure comprising four steps per iteration
 [S1]

Update dual variables:
(7) (8) (9) (10) (11)  [S2]

Update first group of primal variables:
(12)  [S3]

Update second group of primal variables:
(13)  [S4]

Update auxiliary primal variables:
(14)
This fourstep procedure implements a blockcoordinate descent method with dual variable updates. At each step while minimizing the augmented Lagrangian, the variables not being updated are treated as fixed and are substituted with their most up to date values. Different from ADMoM, the alternatingminimization step here generally cycles over three groups of primal variables  (cf. two groups in ADMoM [Bertsekas_Book_Distr]). In some special instances detailed in Sections IVC and IVD, cycling over two groups of variables only is sufficient. In [S1], is the step size of the subgradient ascent iterations (7)(11). While it is common in ADMoM implementations to select , a distinction between the step size and the penalty parameter is made explicit here in the interest of generality.
Reformulating the estimator (P1) to its equivalent form (P4) renders the augmented Lagrangian in (IIIC) highly decomposable. The separability comes in two flavors, both with respect to the variable groups , , and , as well as across the network agents . This in turn leads to highly parallelized, simplified recursions corresponding to the aforementioned four steps. Specifically, it is shown in Appendix B that if the multipliers are initialized to zero, [S1][S4] constitute the distributed algorithm tabulated under Algorithm 1. For conciseness in presenting the algorithm, define the local residuals . In addition, define the softthresholding matrix with th entry given by , where denotes the th entry of .
Remark 1 (Simplification of redundant variables)
Careful inspection of Algorithm 1 reveals that the inherently redundant auxiliary variables and multipliers have been eliminated. Agent does not need to separately keep track of all its nonredundant multipliers , but only to update their respective (scaled) sums and .
Remark 2 (Computational and communication cost)
The main computational burden of the algorithm stems from solving unconstrained quadratic programs locally to update , and to carry out simple softthresholding operations to update . On a per iteration basis, network agents communicate their updated local estimates with their neighbors, to carry out the updates of the primal and dual variables during the next iteration. Regarding communication cost, is a matrix and its transmission does not incur significant overhead when is small. In addition, the matrix is sparse, and can be communicated efficiently. Observe that the dual variables need not be exchanged.
Remark 3 (General sparsitypromoting regularization)
Even though was adopted in (P1) to encourage sparsity in the entries of , the algorithmic framework here can accommodate more general structured sparsitypromoting penalties . To maintain the peragent computational complexity at affordable levels, the minimum requirement on the admissible penalties is that the proximal operator
(15) 
is given in terms of vector or (and) scalar softthresholding operators. In addition to the norm (Lasso penalty), this holds for the sum of rowwise norms (group Lasso penalty [yuan_group_lasso]), or, a linear combination of the aforementioned two – the sotermed hierarchical Lasso penalty that encourages sparsity across and within the rows of [hilasso]. All this is possible since by introducing the costsplitting variables , the local sparse matrix updates are for suitable (see Appendix B).
When employed to solve nonconvex problems such as (P4), ADMoM (or its variant used here) offers no convergence guarantees. However, there is ample experimental evidence in the literature which supports convergence of ADMoM, especially when the nonconvex problem at hand exhibits “favorable” structure. For instance, (P4) is biconvex and gives rise to the strictly convex optimization subproblems (12)(14), which admit unique closedform solutions per iteration. This observation and the linearity of the constraints endow Algorithm 1 with good convergence properties – extensive numerical tests including those presented in Section V demonstrate that this is indeed the case. While a formal convergence proof goes beyond the scope of this paper, the following proposition proved in Appendix C asserts that upon convergence, Algorithm 1 attains consensus and global optimality.
Proposition 2: If the sequence of iterates generated by Algorithm 1 converge to , and (a1) holds, then: i) ; and ii) if , then and , where is the global optimum of (P1).
Iv Applications
This section outlines a few applications that could benefit from the distributed sparsityregularized rank minimization framework described so far. In each case, the problem statement calls for estimating lowrank and (or) sparse , given distributed data adhering to an applicationdependent model subsumed by (2). Customized algorithms are thus obtained as special cases of the general iterations in Algorithm 1.
Iva Unveiling traffic anomalies in backbone networks
In the backbone of largescale networks, origintodestination (OD) traffic flows experience abrupt changes which can result in congestion, and limit the quality of service provisioning of the end users. These sotermed traffic volume anomalies could be due to external sources such as network failures, denial of service attacks, or, intruders which hijack the network services [MC03] [LCD04]. Unveiling such anomalies is a crucial task towards engineering network traffic. This is a challenging task however, since the available data are usually highdimensional noisy linkload measurements, which comprise the superposition of unobservable OD flows as explained next.
The network is modeled as in Section II, and transports a set of endtoend flows (with ) associated with specific sourcedestinations pairs. For backbone networks, the number of network layer flows is typically much larger than the number of physical links . Singlepath routing is considered here to send the traffic flow from a source to its intended destination. Accordingly, for a particular flow multiple links connecting the corresponding sourcedestination pair are chosen to carry the traffic. Sparing details that can be found in [MMG12], the traffic carried over links and measured at time instants can be compactly expressed as
(16) 
where the fat routing matrix is fixed and given, denotes the unknown “clean” traffic flows over the time horizon of interest, and collects the traffic volume anomalies. These data are distributed. Agent acquires a few rows of corresponding to the linkload traffic measurements from its outgoing links, and has available its local routing table which indicates the OD flows routed through . Assuming a suitable ordering of links, the per agent quantities relate to their global counterparts in (16) through and .
Common temporal patterns among the traffic flows in addition to their periodic behavior, render most rows (respectively columns) of linearly dependent, and thus typically has low rank [LCD04]. Anomalies are expected to occur sporadically over time, and only last for short periods relative to the (possibly long) measurement interval . In addition, only a small fraction of the flows are supposed to be anomalous at any given time instant. This renders the anomaly matrix sparse across rows and columns. Given local measurements and the routing tables , the goal is to estimate in a distributed fashion, by capitalizing on the sparsity of and the lowrank property of . Since the primary goal is to recover , define which inherits the lowrank property from , and consider [cf. (16)]
(17) 
Model (17) is a special case of (2), when all the entries of are observed, i.e., . Note that is not sparse even though is itself sparse, hence principal components pursuit is not applicable here [zlwcm10]. Instead, the following estimator is adopted to unveil network anomalies [MMG12]
which is subsumed by (P1). Accordingly, a distributed algorithm can be readily obtained by simplifying the general iterations under Algorithm 1, the subject dealt with next.
Distributed Algorithm for Unveiling Network Anomalies (DUNA). For the specific case here in which , the residuals in Algorithm 1 reduce to . Accordingly, to update the primal variables , and as per Algorithm 1, one needs to solve respective unconstrained strictly convex quadratic optimization problems. These admit closedform solutions detailed under Algorithm 2. The DUNA updates of the local anomaly matrices are given in terms of softthresholding operations, exactly as in Algorithm 1.
Conceivably, the number of flows can be quite large, thus inverting the matrix to update could be complex computationally. Fortunately, the inversion needs to be carried out once, and can be performed and cached offline. In addition, to reduce the inversion cost, the SVD of the local routing matrices can be obtained first, and the matrix inversion lemma can be subsequently employed to obtain , where
Comments
There are no comments yet.