Let be a low-rank matrix , and be a sparse matrix with support size considerably smaller than . Consider also a matrix and a set of index pairs that define a sampling of the entries of . Given and a number of (possibly) noise corrupted measurements
the goal is to estimate low-rank and sparse
, by denoising the observed entries and imputing the missing ones. Introducing the sampling operatorwhich sets the entries of its matrix argument not in to zero and leaves the rest unchanged, the data model can be compactly written in matrix form as
A natural estimator accounting for the low rank of and the sparsity of will be sought to fit the data in the least-squares (LS) error sense, as well as minimize the rank of , and the number of nonzero entries of measured by its -(pseudo) norm; see e.g. [CP10], [MMG12], [CLMW09], [CSPW11] for related problems subsumed by the one described here. Unfortunately, both rank and -norm minimization are in general NP-hard problems [Natarajan_NP_duro, rank_NP_Duro]. Typically, the nuclear norm ( denotes the -th singular value of ) and the -norm are adopted as surrogates, since they are the closest convex approximants to and , respectively [RFP07, CT05, tropp06tit]. Accordingly, one solves
where are rank- and sparsity-controlling parameters. Being convex (P1) is appealing, and some of its special instances are known to attain good performance in theory and practice. For instance, when no data are missing (P1) can be used to unveil traffic anomalies in networks [MMG12]. Results in [MMG12] show that and can be exactly recovered in the absence of noise, even when is a fat (compression) operator. WhenCLMW09] and [CSPW11]. Moreover, for the special case , (P1) offers a low-rank matrix completion alternative with well-documented merits; see e.g., [CR08] and [CP10]. Stable recovery results in the presence of noise are also available for matrix completion and robust PCA [CP10, zlwcm10]
. Earlier efforts dealing with the recovery of sparse vectors in noise led to similar performance guarantees; see e.g.,[bickel09].
In all these works, the samples and matrix are assumed centrally available, so that they can be jointly processed to estimate and by e.g., solving (P1). Collecting all this information can be challenging though in various applications of interest, or may be even impossible in e.g., wireless sensor networks (WSNs) operating under stringent power budget constraints. In other cases such as the Internet or collaborative marketing studies, agents providing private data for e.g., fitting a low-rank preference model, may not be willing to share their training data but only the learning results. Performing the optimization in a centralized fashion raises robustness concerns as well, since the central processor represents an isolated point of failure. Scalability is yet another driving force motivating distributed solutions. Several customized iterative algorithms have been proposed to solve instances of (P1), and have been shown effective in tackling low- to medium-size problems; see e.g., [MMG12], [CR08], [RFP07]
. However, most algorithms require computation of singular values per iteration and become prohibitively expensive when dealing with high-dimensional data[Recht_Parallel_2011]. All in all, the aforementioned reasons motivate the reduced-complexity distributed algorithm for nuclear and -norm minimization developed in this paper.
In a similar vein, stochastic gradient algorithms were recently developed for large-scale problems entailing regularization with the nuclear norm [Recht_Parallel_2011]. Even though iterations in [Recht_Parallel_2011] are highly paralellizable, they are not applicable to networks of arbitrary topology. There are also several studies on distributed estimation of sparse signals via -norm regularized regression; see e.g., [mateos_dlasso, dusan11, chouvardas11]. Different from the treatment here, the data model of [mateos_dlasso] is devoid of a low-rank component and all the observations are assumed available (but distributed across several interconnected agents). Formally, the model therein is a special case of (2) with , , and , in which case (P1) boils down to finding the least-absolute shrinkage and selection operator (Lasso) [tibshirani_lasso].
Building on the general model (2) and the centralized estimator (P1), this paper develops decentralized algorithms to estimate low-rank and sparse matrices, based on in-network processing of a small subset of noise-corrupted and spatially-distributed measurements (Section III). This is a challenging task however, since the non-separable nuclear-norm present in (P1) is not amenable to distributed minimization. To overcome this limitation, an alternative characterization of the nuclear norm is adopted in Section III-A, which leads to a separable yet non-convex cost that is minimized via the alternating-direction method of multipliers (AD-MoM) [Bertsekas_Book_Distr]. The novel distributed iterations entail reduced-complexity optimization subtasks per agent, and affordable message passing only between single-hop neighbors (Section III-C). Interestingly, the distributed (non-convex) estimator upon convergence provably attains the global optimum of its centralized counterpart (P1), regardless of initialization. To demonstrate the generality of the proposed estimator and its algorithmic framework, four networking-related application domains are outlined in Section IV, namely: i) unveiling traffic volume anomalies for large-scale networks [LCD04, MMG12]; ii) robust PCA [CSPW11, CLMW09]; iii) low-rank matrix completion for networkwide path latency prediction [latencyprediction], and iv) spectrum sensing for cognitive radio (CR) networks [mateos_dlasso, juan_sensig]. Numerical tests with synthetic and real network data drawn from these application domains corroborate the effectiveness and convergence of the novel distributed algorithms, as well as their centralized performance benchmarks (Section V).
Section VI concludes the paper, while several technical details are deferred to the Appendix.
Notation: Bold uppercase (lowercase) letters will denote matrices (column vectors), and calligraphic letters will be used for sets. Operators , , , and , will denote transposition, matrix trace, maximum singular value, Hadamard product, and Kronecker product, respectively; will be used for the cardinality of a set, and the magnitude of a scalar. The matrix vectorization operator stacks the columns of matrix on top of each other to return a supervector, and its inverse is . The diagonal matrix has the entries of on its diagonal, and the positive semidefinite matrix will be denoted by . The -norm of is for . For two matrices , denotes their trace inner product. The Frobenious norm of matrix is , is the spectral norm, is the -norm, is the -norm, and is the nuclear norm, where denotes the -th singular value of . The identity matrix will be represented by , while will stand for vector of all zeros, and . Similar notations will be adopted for vectors (matrices) of all ones.
Ii Preliminaries and Problem Statement
Consider networked agents capable of performing some local computations, as well as exchanging messages among directly connected neighbors. An agent should be understood as an abstract entity, e.g., a sensor in a WSN, a router monitoring Internet traffic; or a sensing CR from a next-generation communications technology. The network is modeled as an undirected graph , where the set of nodes corresponds to the network agents, and the edges (links) in represent pairs of agents that can communicate. Agent communicates with its single-hop neighboring peers in , and the size of the neighborhood will be henceforth denoted by . To ensure that the data from an arbitrary agent can eventually percolate through the entire network, it is assumed that:
Graph is connected; i.e., there exists a (possibly) multi-hop path connecting any two agents.
With reference to the low-rank and sparse matrix recovery problem outlined in Section I, in the network setting envisioned here each agent acquires a few incomplete and noise-corrupted rows of matrix . Specifically, the local data available to agent is matrix , where , , and . The index pairs in are those in for which the row index matches the rows of observed by agent . Additionally, suppose that agent has available the local matrix , containing a row subset of associated with the observed rows in , i.e, . Agents collaborate to form the wanted estimator (P1) in a distributed fashion, which can be equivalently rewritten as
The objective of this paper is to develop a distributed algorithm for sparsity-regularized rank minimization via (P1), based on in-network processing of the locally available data. The described setup naturally suggests three characteristics that the algorithm should exhibit: c1) agent should obtain an estimate of and , which coincides with the corresponding solution of the centralized estimator (P1) that uses the entire data ; c2) processing per agent should be kept as simple as possible; and c3) the overhead for inter-agent communications should be affordable and confined to single-hop neighborhoods.
Iii Distributed Algorithm for In-Network Operation
To facilitate reducing the computational complexity and memory storage requirements of the distributed algorithm sought, it is henceforth assumed that:
An upper bound on the rank of matrix obtained via (P1) is available.
As argued next, the smaller the value of , the more efficient the algorithm becomes. A small value of is well motivated in various applications. For example, the Internet traffic analysis of backbone networks in [LCD04] demonstrates that origin-to-destination flows have a very low intrinsic dimensionality, which renders the traffic matrix low rank; see also Section IV-A. In addition, recall that is controlled by the choice of in (P1), and the rank of the solution can be made small enough, for sufficiently large . Because , (P1)’s search space is effectively reduced and one can factorize the decision variable as , where and are and matrices, respectively. Adopting this reparametrization of in (P1), one obtains the following equivalent optimization problem
which is non-convex due to the bilinear terms , and . The number of variables is reduced from in (P1), to in (P2). The savings can be significant when is in the order of a few dozens, and both and are large. The dominant -term in the variable count of (P3) is due to , which is sparse and can be efficiently handled even when both and are large. Problem (P3) is still not amenable to distributed implementation due to: (i) the non-separable nuclear norm present in the cost function; and (ii) the global variables and coupling the per-agent summands.
Iii-a A separable nuclear norm regularization
To address (i), consider the following neat characterization of the nuclear norm [RFP07, Recht_Parallel_2011]
For an arbitrary matrix with SVD , the minimum in (3) is attained for and . The optimization (3) is over all possible bilinear factorizations of , so that the number of columns of and is also a variable. Leveraging (3), the following reformulation of (P2) provides an important first step towards obtaining a distributed estimator:
As asserted in the following lemma, adopting the separable Frobenius-norm regularization in (P3) comes with no loss of optimality, provided the upper bound is chosen large enough.
Lemma 1: Under (a2), (P3) is equivalent to (P1).
Let denote the minimizer of (P1). Clearly, , implies that (P2) is equivalent to (P1). From (3) one can also infer that for every feasible solution , the cost in (P3) is no smaller than that of (P2). The gap between the globally minimum costs of (P2) and (P3) vanishes at , , and , where . Therefore, the cost functions of (P1) and (P3) are identical at the minimum.
Lemma III-A ensures that by finding the global minimum of (P3) [which could have significantly less variables than (P1)], one can recover the optimal solution of (P1). However, since (P3) is non-convex, it may have stationary points which need not be globally optimum. Interestingly, the next proposition shows that under relatively mild assumptions on
and the noise variance, every stationary point of (P3) is globally optimum for (P1). For a proof, see Appendix A.
Proposition 1: Let be a stationary point of (P3). If (no subscript in signifies spectral norm), then is the globally optimal solution of (P1).
Condition captures tacitly the role of , the number of columns of and in the postulated model. In particular, for sufficiently small the residual becomes large and consequently the condition is violated (unless is large enough, in which case a sufficiently low-rank solution to (P1) is expected). This is manifested through the fact that the extra condition in Lemma III-A is no longer needed. In addition, note that the noise variance certainly affects the value of , and thus satisfaction of the said condition.
Iii-B Local variables and consensus constraints
To decompose the cost function in (P3), in which summands are coupled through the global variables and [cf. (ii) at the beginning of this section], introduce auxiliary variables representing local estimates of per agent . These local estimates are utilized to form the separable constrained minimization problem
For reasons that will become clear later on, additional variables were introduced to split the norm fitting-error part of the cost of (P4), from the norm regularization on the (cf. Remark 3). These extra variables are not needed if . The set of additional constraints ensures that, in this sense, nothing changes in going from (P3) to (P4). Most importantly, (P3) and (P4) are equivalent optimization problems under (a1). The equivalence should be understood in the sense that and likewise for , where and are the optimal solutions of (P4) and (P3), respectively. Of course, the corresponding estimates of will coincide as well. Even though consensus is a fortiori imposed within neighborhoods, it extends to the whole (connected) network and local estimates agree on the global solution of (P3). To arrive at the desired distributed algorithm, it is convenient to reparametrize the consensus constraints in (P4) as
where are auxiliary optimization variables that will be eventually eliminated.
Iii-C The alternating-direction method of multipliers
To tackle the constrained minimization problem (P4), associate Lagrange multipliers with the splitting constraints , . Likewise, associate additional dual variables and ( and ) with the first pair of consensus constraints in (4) [respectively (5)]. Next introduce the quadratically augmented Lagrangian function
where is a positive penalty coefficient, and the primal variables are split into three groups , , and . For notational convenience, collect all multipliers in . Note that the remaining constraints in (4) and (5), namely , have not been dualized.
To minimize (P4) in a distributed fashion, a variation of the alternating-direction method of multipliers (AD-MoM) will be adopted here. The AD-MoM is an iterative augmented Lagrangian method especially well-suited for parallel processing [Bertsekas_Book_Distr], which has been proven successful to tackle the optimization tasks encountered e.g., with distributed estimation problems [Yannis_Ale_GG_PartI, mateos_dlasso]. The proposed solver entails an iterative procedure comprising four steps per iteration
Update dual variables:
(7) (8) (9) (10) (11)
Update first group of primal variables:
Update second group of primal variables:
Update auxiliary primal variables:
This four-step procedure implements a block-coordinate descent method with dual variable updates. At each step while minimizing the augmented Lagrangian, the variables not being updated are treated as fixed and are substituted with their most up to date values. Different from AD-MoM, the alternating-minimization step here generally cycles over three groups of primal variables - (cf. two groups in AD-MoM [Bertsekas_Book_Distr]). In some special instances detailed in Sections IV-C and IV-D, cycling over two groups of variables only is sufficient. In [S1], is the step size of the subgradient ascent iterations (7)-(11). While it is common in AD-MoM implementations to select , a distinction between the step size and the penalty parameter is made explicit here in the interest of generality.
Reformulating the estimator (P1) to its equivalent form (P4) renders the augmented Lagrangian in (III-C) highly decomposable. The separability comes in two flavors, both with respect to the variable groups , , and , as well as across the network agents . This in turn leads to highly parallelized, simplified recursions corresponding to the aforementioned four steps. Specifically, it is shown in Appendix B that if the multipliers are initialized to zero, [S1]-[S4] constitute the distributed algorithm tabulated under Algorithm 1. For conciseness in presenting the algorithm, define the local residuals . In addition, define the soft-thresholding matrix with -th entry given by , where denotes the -th entry of .
Remark 1 (Simplification of redundant variables)
Careful inspection of Algorithm 1 reveals that the inherently redundant auxiliary variables and multipliers have been eliminated. Agent does not need to separately keep track of all its non-redundant multipliers , but only to update their respective (scaled) sums and .
Remark 2 (Computational and communication cost)
The main computational burden of the algorithm stems from solving unconstrained quadratic programs locally to update , and to carry out simple soft-thresholding operations to update . On a per iteration basis, network agents communicate their updated local estimates with their neighbors, to carry out the updates of the primal and dual variables during the next iteration. Regarding communication cost, is a matrix and its transmission does not incur significant overhead when is small. In addition, the matrix is sparse, and can be communicated efficiently. Observe that the dual variables need not be exchanged.
Remark 3 (General sparsity-promoting regularization)
Even though was adopted in (P1) to encourage sparsity in the entries of , the algorithmic framework here can accommodate more general structured sparsity-promoting penalties . To maintain the per-agent computational complexity at affordable levels, the minimum requirement on the admissible penalties is that the proximal operator
is given in terms of vector or (and) scalar soft-thresholding operators. In addition to the -norm (Lasso penalty), this holds for the sum of row-wise -norms (group Lasso penalty [yuan_group_lasso]), or, a linear combination of the aforementioned two – the so-termed hierarchical Lasso penalty that encourages sparsity across and within the rows of [hilasso]. All this is possible since by introducing the cost-splitting variables , the local sparse matrix updates are for suitable (see Appendix B).
When employed to solve non-convex problems such as (P4), AD-MoM (or its variant used here) offers no convergence guarantees. However, there is ample experimental evidence in the literature which supports convergence of AD-MoM, especially when the non-convex problem at hand exhibits “favorable” structure. For instance, (P4) is bi-convex and gives rise to the strictly convex optimization subproblems (12)-(14), which admit unique closed-form solutions per iteration. This observation and the linearity of the constraints endow Algorithm 1 with good convergence properties – extensive numerical tests including those presented in Section V demonstrate that this is indeed the case. While a formal convergence proof goes beyond the scope of this paper, the following proposition proved in Appendix C asserts that upon convergence, Algorithm 1 attains consensus and global optimality.
Proposition 2: If the sequence of iterates generated by Algorithm 1 converge to , and (a1) holds, then: i) ; and ii) if , then and , where is the global optimum of (P1).
This section outlines a few applications that could benefit from the distributed sparsity-regularized rank minimization framework described so far. In each case, the problem statement calls for estimating low-rank and (or) sparse , given distributed data adhering to an application-dependent model subsumed by (2). Customized algorithms are thus obtained as special cases of the general iterations in Algorithm 1.
Iv-a Unveiling traffic anomalies in backbone networks
In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt changes which can result in congestion, and limit the quality of service provisioning of the end users. These so-termed traffic volume anomalies could be due to external sources such as network failures, denial of service attacks, or, intruders which hijack the network services [MC03] [LCD04]. Unveiling such anomalies is a crucial task towards engineering network traffic. This is a challenging task however, since the available data are usually high-dimensional noisy link-load measurements, which comprise the superposition of unobservable OD flows as explained next.
The network is modeled as in Section II, and transports a set of end-to-end flows (with ) associated with specific source-destinations pairs. For backbone networks, the number of network layer flows is typically much larger than the number of physical links . Single-path routing is considered here to send the traffic flow from a source to its intended destination. Accordingly, for a particular flow multiple links connecting the corresponding source-destination pair are chosen to carry the traffic. Sparing details that can be found in [MMG12], the traffic carried over links and measured at time instants can be compactly expressed as
where the fat routing matrix is fixed and given, denotes the unknown “clean” traffic flows over the time horizon of interest, and collects the traffic volume anomalies. These data are distributed. Agent acquires a few rows of corresponding to the link-load traffic measurements from its outgoing links, and has available its local routing table which indicates the OD flows routed through . Assuming a suitable ordering of links, the per agent quantities relate to their global counterparts in (16) through and .
Common temporal patterns among the traffic flows in addition to their periodic behavior, render most rows (respectively columns) of linearly dependent, and thus typically has low rank [LCD04]. Anomalies are expected to occur sporadically over time, and only last for short periods relative to the (possibly long) measurement interval . In addition, only a small fraction of the flows are supposed to be anomalous at any given time instant. This renders the anomaly matrix sparse across rows and columns. Given local measurements and the routing tables , the goal is to estimate in a distributed fashion, by capitalizing on the sparsity of and the low-rank property of . Since the primary goal is to recover , define which inherits the low-rank property from , and consider [cf. (16)]
Model (17) is a special case of (2), when all the entries of are observed, i.e., . Note that is not sparse even though is itself sparse, hence principal components pursuit is not applicable here [zlwcm10]. Instead, the following estimator is adopted to unveil network anomalies [MMG12]
which is subsumed by (P1). Accordingly, a distributed algorithm can be readily obtained by simplifying the general iterations under Algorithm 1, the subject dealt with next.
Distributed Algorithm for Unveiling Network Anomalies (DUNA). For the specific case here in which , the residuals in Algorithm 1 reduce to . Accordingly, to update the primal variables , and as per Algorithm 1, one needs to solve respective unconstrained strictly convex quadratic optimization problems. These admit closed-form solutions detailed under Algorithm 2. The DUNA updates of the local anomaly matrices are given in terms of soft-thresholding operations, exactly as in Algorithm 1.
Conceivably, the number of flows can be quite large, thus inverting the matrix to update could be complex computationally. Fortunately, the inversion needs to be carried out once, and can be performed and cached off-line. In addition, to reduce the inversion cost, the SVD of the local routing matrices can be obtained first, and the matrix inversion lemma can be subsequently employed to obtain , where