 # Estimate Exchange over Network is Good for Distributed Hard Thresholding Pursuit

We investigate an existing distributed algorithm for learning sparse signals or data over networks. The algorithm is iterative and exchanges intermediate estimates of a sparse signal over a network. This learning strategy using exchange of intermediate estimates over the network requires a limited communication overhead for information transmission. Our objective in this article is to show that the strategy is good for learning in spite of limited communication. In pursuit of this objective, we first provide a restricted isometry property (RIP)-based theoretical analysis on convergence of the iterative algorithm. Then, using simulations, we show that the algorithm provides competitive performance in learning sparse signals vis-a-vis an existing alternate distributed algorithm. The alternate distributed algorithm exchanges more information including observations and system parameters.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The topic of estimating and/or learning of sparse signals has many applications, such as machine learning

[1, 2], multimedia processing , compressive sensing , wireless communications , just to mention a few. Here we consider a distributed sparse learning problem. Consider a network consisting of nodes with a network matrix describing the connections among the nodes. The ’th element of specifies the weight of the link from node to node . A zero valued signifies the absence of a direct link from node to node . In the literature, is also known as a network policy matrix. Let be the underlying sparse signal to estimate/learn. Further assume that node has the observation

 yl=Alx+el. (1)

Here is the error term and is the system matrix (a sensing matrix or dictionary, depending on a particular application) at node . For distributed learning, the nodes of a network exchange various information, for example, intermediate estimates of , observations , system matrices , or their parameters. Using such information, a distributed algorithm learns at each node over iterations. Few important aspects of a distributed algorithm are scalability with the number of nodes, low computational requirements at each node, and limited communication between nodes. These aspects are required to realize a large distributed system with many nodes and high-dimensional system matrices. To comply with the low computational aspect, we focus on greedy algorithms. Standard greedy algorithms, such as orthogonal matching pursuit , subspace pursuit , CoSaMP , and their variants [9, 10] are of low computational load and fast in execution. Further, for limited communication aspect, we prefer that intermediate estimates of are exchanged over the network. Therefore we focus on developing distributed greedy sparse learning algorithms that exchange intermediate estimates of or relevant parameters over network. Relevant past works in this direction are [11, 12, 13] where we proposed some rules for information exchange over network and developed new greedy algorithms for various signal models.

In this current article, we investigate an existing distributed greedy algorithm in the literature . Our interest is to analyze performance of the algorithm. We do not propose any new rule or new algorithm. The existing algorithm of  exchanges intermediate estimates of between the nodes of the network. We refer to this algorithm as the distributed hard thresholding pursuit (DHTP). DHTP has not received due attention in  – minor simulation results were reported and no theoretical analysis was performed. For the DHTP, our main objective is to show that estimate-exchange is a good strategy for achieving a good learning performance. In pursuit of this objective, we provide theoretical analysis and extensive simulation results. Using simulations, we show that there is no significant incentive in performance gain due to exchange of observations and system matrices . Exchange of intermediate estimates over network is good and that helps in communication and computation constrained scenarios.

### I-a Literature survey

We provide a literature survey for the problem of sparse learning over network. For this problem, learning by consensus is a popular strategy. A consensus strategy achieves estimates that are all the same at network nodes after convergence. That means , where denotes the signal estimate at node at convergence of a distributed sparse learning algorithm. Using a greedy learning approach, a consensus seeking algorithm was recently proposed in . This algorithm is referred to as the distributed hard thresholding (DiHaT). To achieve consensus, the DiHaT exchanges intermediate estimates of , observations and system matrices . Furthermore, DiHaT uses a consensus seeking network matrix with special properties; DiHaT requires that

be a doubly stochastic matrix. Here consensus seeking means that each and every node have estimates that are same at convergence. Alternatively, without seeking consensus, there exist several greedy sparse learning algorithms. DHTP of

 is one example. Other examples include algorithms from [15, 16, 12, 17, 18].

For distributed sparse learning, there exist several convex optimization based algorithms, mainly in the application area of distributed compressed sensing [19, 20]. Some of the algorithms provide a centralized solution using a distributed convex optimization algorithm called the alternating-direction-method-of-multipliers (ADMM) . ADMM based distributed learning algorithms were proposed in [22, 23]. Specifically, the ADMM based method of  is called D-LASSO. It is worth mentioning that the D-LASSO is shown to provide a slower convergence compared to greedy DiHaT in . Using adaptive signal processing techniques such as gradient search, distributed sparse learning and sparse regression were realized in [24, 25, 26]. These adaptive algorithms typically use a mean-square-error cost averaged over all the nodes in a network to find an optimal solution via gradient search. Distributed learning and regression are then performed via diffusion of information over a network and adaptation in all individual nodes. Using a Bayesian framework for finding the posterior with sparsity promoting priors, a distributed-message-passing based method was proposed in  and a sparse Bayesian learning based method was proposed in . Further, to promote sparsity in solutions, distributed system learning such as distributed dictionary learning was also considered in [29, 30]. Next, we mention that there exist several signal models in the literature where sparse signals are not the same for all nodes of a network. For example, denoting the signal at node by , supports of are the same in [17, 31, 32, 33], but not their signal values; further, have common and private support and/or signal parts in [34, 12, 18]. In this article, we consider the setup (1) where , .

### I-B Contributions

Our objective is to show that signal estimate exchange is good for distributed sparse learning. There is no need to exchange and . In pursuit of this objective, we investigate DHTP and our contributions are as follows.

1. We provide a restricted-isometry-property (RIP) based theoretical analysis and convergence guarantee for DHTP. For error-free condition, that means , learned estimates at all nodes converge to the true signal under some technical conditions.

2. Using simulations, we show instances where DHTP provides better learning performance than DiHaT. For a fair comparison, we evaluate practical performance using doubly stochastic matrix.

3. We show that DHTP performs good for a general network matrix, not necessarily a doubly stochastic matrix.

### I-C Notation

Support-set of is defined as . We use and to denote the cardinality and complement of the set , respectively. For a matrix , a sub-matrix consists of the columns of indexed by . Similarly, for

, a sub-vector

is composed of the components of indexed by . Also we denote and as transpose and pseudo-inverse, respectively. We define the function the set of indices corresponding to the largest amplitude components of . We use to denote the signal estimate at node and iteration .

## Ii DHTP Algorithm and Theoretical Analysis

The pseudo-code of the DHTP algorithm is shown in Algorithm 1. In every iteration , the nodes use standard algorithmic steps of Hard Thresholding Pursuit algorithm (HTP)  along-with an extra step to include information about the estimates at the neighbors to refine the local estimate (see Step 3 of Algorithm 1). Here, we denote the neighborhood of node by , i.e. . For theoretical analysis, we use the standard definition of RIP of a matrix as given in . We denote the -Restricted Isometry Constant (RIC) of a matrix by . We use and to denote the standard and norm of a vector, respectively. Throughout the paper unless specified, we have the following assumptions.

###### Assumption 1

The network matrix is a right stochastic matrix. This assumption is quite general as any non-negative matrix can be reduced to a right stochastic matrix by appropriately scaling the rows of the matrix.

###### Assumption 2

The sparsity level of the signal , denoted by is known a-priori. This assumption is used in the greedy algorithms such as CoSaMP, subspace pursuit , HTP , etc.

We first provide a recurrence inequality for DHTP, which provides performance bounds of the algorithm over iterations. For notational clarity, we define RIC constant

 δas≜maxl{δas(Al)},

where is the RIC of and is a positive integer such as 1, 2 or 3.

###### Theorem 1 (Recurrence inequality)

The performance of the DHTP algorithm at iteration can be bounded as

 L∑l=1∥x−^xl,k∥≤c1L∑l=1wl∥x−^xl,k−1∥+d1L∑l=1wl∥el∥,

where .

Detailed proof of the above theorem is shown in Section V. We use some intermediate steps in the proof of theorem 1 for addressing convergence of DHTP. We show convergence by two alternative approaches, in the following two theorems.

###### Theorem 2 (Convergence)

Let denote the magnitude of the ’th highest amplitude element of and . If and , then the DHTP algorithm converges after iterations, and its performance is bounded by

 ∥x−^xl,¯k∥≤d∥e∥max,

where , , and are positive constants in terms of . Under the above conditions, estimated support sets across all nodes are equal to the correct support set, that means

 ∀l,^Tl=supp(x,s).

For an interpretation of the above theorem, we provide a numerical example. If then we have , ; for an appropriate such that , the performance is upper bounded by after iterations.

###### Corollary 1

Consider the special case of a doubly stochastic network matrix . Under the same conditions stated in Theorem 2, we have

 ∥x––−^x¯k–––∥≤d∥e––∥,

where , and . This upper bound is tighter than the bound of Theorem 2.

###### Theorem 3 (Convergence)

If and , then after iterations, DHTP algorithm converges and its performance is bounded by

 ∥x−^xl,¯k∥≤d∥e∥max,

where .

A relevant numerical example for interpretation of Theorem 3 is as follows: if and dB, then we have and . The proofs of the theorems and corollary are presented in Section V.

It can be seen that the DHTP algorithm has a convergence guarantee when . Note that the requirement on signal-to-noise relation in Theorem 3 is weaker than the requirement in Theorem 2. The above results can be readily extended to the noiseless case, that means . For the noiseless case, DHTP provides the exact estimate of the sparse signal at every node.

### Ii-a Similarities and differences between DHTP and DiHaT

The DiHaT algorithm of  is shown in Algorithm 2. Comparing with Algorithm 1, the similarities and differences between DHTP and DiHaT are given in the list below.

1. DiHaT requires exchange of , and among nodes. DHTP requires only exchange of .

2. DiHaT requires to be a doubly stochastic matrix. This is not a requirement for the case of DHTP.

3. For theoretical convergence proof of DiHaT, an assumption is that the average noise over nodes . On the other hand, DHTP requires a signal-to-noise-ratio term .

4. Denoting , DiHaT converges if . On the other hand, DHTP converges if .

5. DiHaT provides consensus in the sense of achieving same estimation at all nodes under certain technical conditions. DHTP does not provide consensus except in the noiseless case under certain technical conditions.

6. DiHaT requires all to be of the same size. DHTP does not require this condition, that means dimension of can vary across nodes. This is an important advantage in practical scenarios.

## Iii Simulation Results Fig. 1: mSENR performance of DHTP, DiHaT and HTP algorithms with respect to SNR. (a) Performance for a right stochastic network matrix. (b) Performance for a doubly stochastic network matrix. Fig. 2: Probability of perfect support-set estimation (PSE) versus sparsity level. No noise condition and we used a doubly stochastic network matrix.

In this section, we study the practical performance of DHTP using simulations and compare with DiHaT (Algorithm 2). We perform the study using Monte-Carlo simulations over many instances of , , and in the system model (1). The non-zero scalars of sparse signal are i.i.d. Gaussian. This is referred to as a Gaussian sparse signal. We set the number of nodes , and every node in the network is randomly connected to three other nodes apart from itself. The stopping criterion for the algorithms is that the maximum allowable number of iterations equal to 30. We used both right stochastic and doubly stochastic in simulations. Given an edge matrix of network connection between nodes, a right stochastic matrix generation is a simple task. The doubly stochastic matrix is generated through the

second largest eigenvalue modulus

(SLEM) optimization problem . Finally, we also show performance for real image data.

### Iii-a Performance measures

For performance evaluation, we used a mean signal-to-estimation-noise ratio (mSENR) metric, . To generate noisy observations, we used Gaussian noise. The signal-to-noise ratio (SNR), is considered to be the same across all nodes.

### Iii-B Experiments using Simulated Data

We use all that have same row size, that is, . Same row size is necessary to use DiHaT for comparison. For the experiments, we set , and signal dimension . In our first experiment, we compare DHTP, DiHaT and HTP for right stochastic and doubly stochastic . The matrices are shown in the appendix. We set sparsity level . The results are shown in Fig. 1 where we show mSENR versus SNR. We recall that DiHaT was not designed for right stochastic and HTP is a standalone algorithm that does not use the network. For right stochastic , we observe from Fig. 1 (a) that DiHaT does not provide considerable gain over HTP, but DHTP does. On the other hand, for doubly stochastic , we observe from Fig. 1 (b) that DiHaT provides a considerable gain over HTP, but DHTP outperforms DiHaT. The experiment validates that DHTP works for right stochastic . We did experiments with many instances of , and noted similar trend in performance.

Next, we study the probability of perfect signal estimation under a no-noise condition. Under this condition, the probability of perfect signal estimation is equivalent to the probability of perfect support-set estimation (PSE) at all nodes. Keeping and fixed, we vary the value of and compute the probability of PSE using the frequentist approach – how many times PSE occurred. We used the same doubly stochastic of the first experiment. The result is shown in Fig. 2

. It can be seen that the DHTP outperforms DiHaT in the sense of phase transition from perfect to imperfect estimation.

In the third experiment, we observe convergence speed of algorithms. A fast convergence leads to less usage of communication and computational resources, and less time delay in learning. We set . The results are shown in Fig. 3 where we show mSENR versus the number of iterations, for the noiseless condition and 30 dB SNR. We note that the DHTP has a significantly quicker convergence. In our experiments, the DHTP achieved convergence typically within five iterations. Fig. 3: mSENR performance of DHTP and DiHaT with respect to number of information exchange (number of iterations). Fig. 4: Sensitivity performance of DHTP and DiHaT with respect to knowledge of the sparsity level. M = 100, N = 500, s = 20, L = 20, d = 4, SNR = 30dB.

Finally, we experiment to find the sensitivity of the DHTP and the DiHaT algorithms to the prior knowledge of sparsity level. For this, we use 30 dB SNR and . Fig. 4 shows the results for different assumed that varies as , and . We observe that DHTP performs better than DiHaT for all assumed sparsity levels. Also, a typical trend is that the assumption of higher sparsity level is always better than lower sparsity level.

### Iii-C Experiments for real data Fig. 5: Performance comparison of DHTP, DiHaT and HTP over image data. The first column, second column and the third column of the figure correspond to DHTP, DiHaT and HTP, respectively. The top row of the figure contains the original images. The second row contains the reconstructed images using DHTP with sparsity level 11%. The last row shows the PSNR performance of algorithms with respect to varying number of observations (M).

We evaluate the performance on three standard grayscale images: Peppers, Lena and Baboon of size pixels. We consider of highest magnitude DCT coefficients of an image to decide a sparsity level choice. In DCT domain, the signal is split into equal parts (or blocks) for ease of computation. This leads to the value of for each part as close to . We perform reconstruction of the original images using the DHTP, DiHaT and HTP algorithms over the doubly stochastic network matrix chosen in the previous subsection. The performance measure is the peak-signal-to-noise-ratio (PSNR), defined as , where denotes the norm. We show performance for a randomly chosen node among the set of 20 nodes. Fig. 5 shows a plot of the PSNR versus number of observations at each node (). In the same figure, we also show visual reconstruction quality (reconstructed image) at for DHTP. We observe that DHTP has a better convergence rate and PSNR performance than the other two algorithms.

### Iii-D Reproducible research

In the spirit of reproducible research, we provide relevant Matlab codes at www.ee.kth.se/reproducible/ and the link https://sites.google.com/site/saikatchatt/softwares. The code produces the results shown in the figures.

## Iv Conclusion

For sparse learning over a network using distributed greedy algorithms such as the hard thresholding approach, we show that the strategy of exchanging signal estimates between nodes is good for learning. This has an explicit advantage of low communication overhead. We show that appropriate algorithmic strategies work for right stochastic network matrices. Use of right stochastic network matrices has higher generality than the popularly used doubly stochastic network matrices.

## V Details of Theoretical Proofs

### V-a Useful Lemmas

We provide three lemmas here that will be used in the proofs later. The first lemma provides a bound for the orthogonal projection operation used in DHTP.

###### Lemma 1

[38, Lemma 2] Consider the standard sparse representation model with . Let and . Define such that . If has RIC , then we have

 ∥x−¯x∥≤√11−δ2s1+s2∥xSc∥+√1+δs21−δs!+s2∥e∥.

The next lemma gives a useful inequality on squares of polynomials commonly encountered in the proofs.

###### Lemma 2

[38, Lemma 1] For non-negative numbers

The last lemma provides a bound for the energy content in the pruned indices.

###### Lemma 3

[11, Lemma 3] Consider two vectors and with , and . We have and . Let denote the set of indices of the smallest magnitude elements in . Then,

 ∥xS∇∥≤√2∥(x−z)S2∥≤√2∥x−z∥.

### V-B Proof of Theorem 1

At Step 2, using Lemma 1, we have

 ∥x−~xl,k∥ ≤√11−δ22s∥x~Tcl,k∥+√1+δs1−δ2s∥el∥, (a)=√11−δ22s∥(x−~xl,k)~Tcl,k∥+√1+δs1−δ2s∥el∥, (2)

where follows from the construction of . Following the proof of [35, Theorem 3.8], we can write,

 ∥(x−~xl,k)~Tcl,k∥≤√2δ3s∥x−^xl,k−1∥+√2(1+δ2s)∥el∥. (3)

Substituting the above equation in (V-B) , we have

 ∥x−~xl,k∥≤√2δ23s1−δ22s∥x−^xl,k−1∥+d12∥el∥. (4)

where . Next, in step 3, we get

 (5)

where follows as and follows from the fact that is non-negative. Now, we bound the performance over the pruning step in steps 4-5 as follows,

 ∥x−^xl,k∥=∥(x−ˇxl,k)+(ˇxl,k−^xl,k)∥(a)≤∥x−ˇxl,k∥+∥ˇxl,k−^xl,k∥(b)≤2∥x−ˇxl,k∥, (6)

where follows from the triangle inequality and follows from the fact that is the best -size approximation to . Combining (4), (5) and (6), we get

 ∥x−^xl,k∥≤c1∑r∈Nlhlr∥x−^xr,k−1∥+d1∑r∈Nlhlr∥er∥.

Summing the above equation and denoting , we get the result of Theorem 1.

### V-C Proof of Theorem 2

Let be the permutation of indices of such that where for . In other words, is sorted in the descending order of magnitude. Assuming , we need to find the condition such that where .
First, we have the following corollary.

###### Corollary 2
 ∥x~Tck∥–––––––≤c1H∥x~Tck−1∥–––––––––+(d2H+d3I)∥e∥––––,

where , , and .

• The proof of the corollary follows from the following arguments. From (3), we can write

 ∥x~Tcl,k∥≤√2δ3s∥x−^xl,k−1∥+√2(1+δ2s)∥el∥,

due to the construction of . Substituting (V-B), (5) and (6) with ’’ in the above equation, we get

The result follows from vectorizing the above equation.

Next, Lemma 4 derives the condition that , i.e., the desired indices are selected in Step 2 of DHTP. Note that we define .

###### Lemma 4

If and

 x∗p+q>ck2−k11∥x∗{p+1,…,s}∥+d2+d31−c1∥e∥max.

then, contains .

• The first part of the proof of this lemma is similar to the proof of [39, Lemma 3]. It is enough to prove that the highest magnitude indices of contains the indices for . Mathematically, we need to prove,

 minj∈{1,2,…,p+q}∥(^xl,k−1+A⊤l(yl−Al^xl,k−1))π(j)∥>maxd∈Tc ∥(^xl,k−1+A⊤l(yl−Al^xl,k−1))d∥,∀l. (7)

The LHS of (7) can be written as

 |(^xl,k−1+A⊤l(yl−Al^xl,k−1))π(j)|(a)≥|xπ(j)|−|(−x+^xl,k−1+A⊤l(yl−Al^xl,k−1))π(j)|≥x∗p+q−|((A⊤lAl−I)(x−^xl,k−1)+A⊤lel)π(j)|,

where follows from the reverse triangle inequality. Similarly, the RHS of (7) can be written as

 |(^xl,k−1+A⊤l(yl−Al^xl,k−1))d|=|xd+(−x+^xl,k−1+A⊤l(yl−Al^xl,k−1))d|=|((A⊤lAl−I)(x−^xl,k−1)+A⊤lel)d|.

Using the bounds on LHS and RHS, (7) simplifies to

 x∗p+q>|((A⊤lAl−I)(x−^xl,k−1)+A⊤lel)π(j)|+|((A⊤lAl−I)(x−^xl,k−1)+A⊤lel)d|.

Let RHS of the sufficient condition at node be denoted as . Then,

 RHSl≤√2|((A⊤lAl−I)(x−^xl,k−1)+A⊤lel){π(j),d}|≤√2∥((A⊤lAl−I)(x−^xl,k−1)){π(j),d}∥+√2∥(A⊤lel){π(j),d}∥(a)≤√2δ3s∥x−^xl,k−1∥+√2(1+δ2s)∥el∥(b)≤c1∑r∈Nlhlr∥x~Tcr,k−1∥+d2∑r∈Nlhlr∥er∥+d3∥el∥,

where follows from [38, Lemma 4-5] and follows from substituting (V-B), (5) and (6). At iteration , can be vectorized as

Applying Corollary 2 repeatedly, we can write for ,

 RHS––––≤(c1H)k2−k1∥x~Tck1∥––––––––+(d2H+d3IL)(IL+…+(c1H)k2−k1−1)∥e∥––––(a)≤ck2−k11∥x∗{p+1,…,s}∥–––––––––––––+d2+d31−c1∥e∥––––max,

where follows from the assumption that for any , and the right stochastic property of . Now, it can be seen that the bound in (7) is satisfied when

 x∗p+q>ck2−k11∥x∗{p+1,…,s}∥+d2+d31−c1∥e∥max.

Next, we find the condition that in the following Lemma.

###### Lemma 5

If , and

 x∗p+q>c3ck2−k1−11∥x∗{p+1,…,s}∥+(c3(d2+d3)1−c1+d4)∥e∥max

then, contains . The constant .

• It is enough to prove that the highest magnitude indices of contains the indices for . Mathematically, we need to prove,

 minj∈{1,2,…,p+q}∥(ˇxl,k)π(j)∥>maxd∈Tc∥(ˇxl,k)d∥,∀l. (8)

The LHS of (8) can be written as

where and follows from the fact that . Similarly, the RHS of (8) can be bounded as

Using the above two bounds, the condition (8) can now be written as

 x∗p+q>∥∥ ∥∥∑r∈Nlhlr(~x~Tr,k−x~Tr,k)π(j)∥∥ ∥∥+∥∥ ∥∥∑r∈Nlhlr(~x~Tr,k−x~Tr,k)d∥∥ ∥∥. (9)

Define the RHS of the required condition from (9) at node as . Then, we can write the sufficient condition as

 RHSl≤√2∥∥ ∥∥∑r∈Nlhlr(~x~Tr,k−x~Tr,k){π(j),d}∥∥ ∥∥≤√2∑r∈Nlhlr∥x−~xr,k∥.

From the above equation and (4), we can bound as

where follows from substituting (V-B), (5) and (6). At iteration , can be vectorized as

 RHS––––=[RHS1…RHSL]⊤≤c3H∥x~Tck2−1∥––––––––––+d4H∥e∥––––.

Applying Corollary 2 repeatedly, we can write for ,

where follows from the assumption that for any , and the right stochastic property of . From the above bound on , it can be easily seen that (9) is satisfied when

 x∗p+q>c