DeepAI

# Diffusion Adaptation Framework for Compressive Sensing Reconstruction

Compressive sensing(CS) has drawn much attention in recent years due to its low sampling rate as well as high recovery accuracy. As an important procedure, reconstructing a sparse signal from few measurement data has been intensively studied. Many reconstruction algorithms have been proposed and shown good reconstruction performance. However, when dealing with large-scale sparse signal reconstruction problem, the storage requirement will be high, and many algorithms also suffer from high computational cost. In this paper, we propose a novel diffusion adaptation framework for CS reconstruction, where the reconstruction is performed in a distributed network. The data of measurement matrix are partitioned into small parts and are stored in each node, which assigns the storage load in a decentralized manner. The local information interaction provides the reconstruction ability. Then, a simple and efficient gradient-descend based diffusion algorithm has been proposed to collaboratively recover the sparse signal over network. The convergence of the proposed algorithm is analyzed. To further increase the convergence speed, a mini-batch based diffusion algorithm is also proposed. Simulation results show that the proposed algorithms can achieve good reconstruction accuracy as well as fast convergence speed.

• 6 publications
• 167 publications
• 4 publications
• 38 publications
11/28/2019

### Error Resilient Deep Compressive Sensing

Compressive sensing (CS) is an emerging sampling technology that enables...
05/22/2016

### Sparse Signal Reconstruction with Multiple Side Information using Adaptive Weights for Multiview Sources

This work considers reconstructing a target signal in a context of distr...
09/23/2017

### Adaptive Measurement Network for CS Image Reconstruction

Conventional compressive sensing (CS) reconstruction is very slow for it...
01/07/2019

### Compressive-Sensing Data Reconstruction for Structural Health Monitoring: A Machine-Learning Approach

Compressive sensing (CS) has been studied and applied in structural heal...
06/19/2019

### Compressive Closeness in Networks

Distributed algorithms for network science applications are of great imp...
11/05/2018

### Non-Local Compressive Sensing Based SAR Tomography

Tomographic SAR (TomoSAR) inversion of urban areas is an inherently spar...
11/27/2017

### Feedback Acquisition and Reconstruction of Spectrum-Sparse Signals by Predictive Level Comparisons

In this letter, we propose a sparsity promoting feedback acquisition and...

## I Introduction

Compressive sensing(CS) is a novel sampling theory, which provides a new signal sampling (encoding) and reconstruction (decoding) approach [1, 2, 3, 4, 5]. In detail, given a compressible signal where is the transform basis and is a sparse signal, signal can be measured by a nonadaptive linear projections (namely sensing matrix) , i.e. where

is the measurement vector and

is the measurement matrix. Then, at the decoder, (or ) can be recovered from using reconstruction algorithms. Since CS framework can provide far less sampling rate than Nyquist as well as high recovery accuracy, it has been widely used in many applications such as medical imaging [6] and radar imaging [7].

As an important procedure in CS, recovering a sparse signal from insufficient number of measurement data has drawn much attention in recent years. In the last decade, many algorithms have been proposed to show accurate reconstruction performance [8, 9, 10, 11, 12, 13]. An important theoretical guarantee that behind CS reconstruction is the restricted isometry property (RIP) [14]. It has been proved that if obeys RIP, the sparse signal can be recovered from small number of measurements

. It has also been shown that random matrices such as Gaussian matrix and Bernoulli matrix can satisfy the RIP condition with a high probability

[15, 16].

Although CS Reconstruction problem has been intensively studied, applying CS reconstruction to large-scale data (such as image data) is still a challenging work. In [17]

, the authors proposed a conjugate gradient orthogonal matching pursuit (CG-OMP) algorithm. CG-OMP utilizes Structurally Random Matrix (SRM)

[18] as the sensing matrix , which can speed up the signal recovery process as well as reduce the storage requirement. In particular, SRM is related to operator-based approaches, and can improve all greedy algorithms and several iterative shrinkage/threshold (IST) methods such as gradient projection for sparse reconstruction algorithm (GPSR) [19] and sparse reconstruction by separable approximation (SpaRSA) [20]. However, although can be fast computed, the transform basis may not always be fast computable. Thus or may still need to be stored, which needs high requirement of storage for large-scale data. To reduce the storage of , a block based compressive sensing (BCS) method was proposed [21, 22]. In BCS, the input signal is separated into several small block signals, each signal is individually sensed and recovered. BCS essentially utilizes a block diagonal matrix as the sensing matrix, which, however, is lack of theoretical guarantee. Moreover, BCS needs to modify the sampling strategy at the sensing procedure, and the applicability is limited due to unclear structure.

An efficient way to solve the storage problem for large scale data is to store the data in a distributed network, and then optimize the problem in a distributed way. So far, several distributed optimization algorithms have been proposed including distributed stochastic dual coordinate ascent (DisDCA) [23], communication-efficient distributed dual coordinate ascent (CoCoA) [24]

and distributed stochastic variance reduced gradient (DSVRG)

[25]. These algorithms focus on solving the optimization problem in parallel, and need to collect the updated information from all nodes at each iteration. However, in many distributed network such as wireless sensor network (WSN), it is hard to gather the whole information across the network at each iteration.

In this paper we propose a novel diffusion adaptation framework for CS reconstruction. The measurement matrix are partitioned and stored in a decentralized manner, i.e. each node of the network stores only a small part of . Therefore, the whole storage load is distributed to each node in the network. Further, inspired by diffusion adaptation strategies [26, 27, 28, 29, 30], a simple yet efficient diffusion -LMS algorithm (D

-LMS) is proposed. Each node utilizes the finite number of data recursively. The estimation are shared among neighbours at each iteration. Therefore, taking advantages of diffusion adaptation, the D

-LMS algorithm can collaboratively recover the sparse signals across the network. Utilizing -norm as the regularization term can guarantee the sparse estimation. Information exchange within network also provide ability of fast convergence speed, thus greatly increases the computational efficiency. Moreover, a mini-batch based D-LMS (MB-D-LMS) is also proposed in this study. MB-D-LMS utilizes the mini-batch gradient descend (MBGD) method and can further improve the convergence speed.

The proposed D-LMS is a variant of traditional sparse diffusion LMS algorithm [31, 32], which have shown abilities in learning the sparse structure over distributed networks. Actually, the formulation of the D-LMS algorithm for CS is similar to traditional sparse diffusion algorithm except the data is recursively used. However, applying sparse diffusion algorithm to specific CS task brings more features on both experimental results and theoretical analysis. In particular, diffusion adaptation framework with sparsity constrain gives ability to collaboratively estimate the sparse signal, even though each node cannot individually reconstruct the signal due to insufficient numbers of data. Moreover, in CS all data has been already collected before process, thus the finite known data offers convenience for theoretical analysis, and also gives access to use mini-batch method in gradient estimation. Besides, since the reconstruction speed is a critical issue in evaluating the CS reconstruction algorithm, a larger step size is preferred to achieve faster convergence speed. Obviously the small step size assumption in analysis of traditional diffusion algorithm [26, 27, 28, 29, 30] is not suitable for analysis for CS, thus in this paper carry out a new theoretical analysis on the step size condition for convergence without small step size assumption.

The proposed D-LMS is also related to -LMS for CS [33]. -LMS can be seen as a special case of D-LMS where the network contains only 1 node. By introducing traditional sparse LMS algorithm to CS, -LMS algorithm has shown great performance improvement compared with other algorithms. In particular, -LMS demands less requirement in memory, and achieves better reconstruction performance than other existing algorithms when dealing with large-scale CS reconstruction problem. In D-LMS, each node actually performs the same weight update process with -LMS, thus the computational complexity of each node in each iteration are the same as -LMS. Moreover, the diffusion algorithm gives ability to allow much larger step size for convergence condition, and is confirmed by experiments that the convergence speed is much faster than -LMS. Futher, simulations also show that D-LMS can achieve similar reconstruction accuracy with -LMS.

One should also distinguish our work from distributed compressive sensing (DCS) [34, 35, 36, 37]. In DCS, a number of measurement data are recovered by a group of sensors. The measurement data at each node are assumed to be individually sparse in some basis and are correlated from sensor to sensor. The DCS aims to solve the jointly sparse ensemble reconstruction problem, which is not the topic of our work. Another related work is the Distributed Compressed Estimation(DCE) scheme [38]. The DCE incorporates compression and decompression modules into the distributed estimation procedure. The compressed estimator is estimated across the network using diffusion adaptation strategy, and then the reconstruction algorithms are employed to recover the sparse signal from compressed estimator. In DCE, each node still need to store the whole sensing matrix. Moreover, the reconstruction procedure is independent of diffusion adaptation procedure, which may still suffer from the same problem of typical reconstruction methods when dealing with large scale CS reconstruction problem.

The paper is organized as follows. In section II we briefly review the concept of compressive sensing and propose the diffusion adaptation framework for CS reconstruction. The gradient based and the mini batch based algorithms for diffusion CS reconstruction are then proposed in Section III. The stability analysis of D-LMS is carried out in Section IV. In Section V, simulation results are presented to verify the reconstruction performance. Finally, the conclusion is given in Section VI.

## Ii Diffusion Adaptation Framework for Compressive Sensing Reconstruction

Suppose a real valued discrete signal is compressible, i.e. can be represented as where is a transform basis matrix and is a sparse signal with sparsity . In the theory of CS, the sparse signal can be measured by

 ~y=ΦΩs=Θx (1)

where is the measurement matrix, is the sensing matrix and is the measurement vector (). In practice, the observed measurement vector may be noisy, thus the observed measurement vector can be described as

 y=Θx+v (2)

where is the additive noise vector. The CS reconstruction task is to recover the sparse signal from the measurement matrix and the corresponding noisy measurement . To successfully recover , the measurement matrix should obey the restricted isometry property (RIP).

In practice, the CS reconstruction problem can be viewed as solving a sparse constrained least squares problem with the cost function

 J(x)=∥y−Θx∥2+ξ∥x∥0 (3)

where is the regularization term and is regularization parameter.

To apply CS reconstruction in a decentralized manner, one can modify the above cost function. In particular, considering a connected network with nodes (i.e. the network size is ), we can obtain the estimation of by minimizing the following global cost function

 Jglob(x)=P∑k=1∥yk−Θkx∥2+ξ∥x∥0 (4)

where

 Θ=⎡⎢ ⎢ ⎢ ⎢⎣Θ1Θ2⋮ΘP⎤⎥ ⎥ ⎥ ⎥⎦,y=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣y1y2⋮yP⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

with , and . Since Eq.(3) and Eq.(4) are essentially the same cost function, the optimum solution will coincide.

Fig.1 shows an example of diffusion adaptation framework for CS reconstruction. The connected network includes 7 nodes. The whole is partitioned into small parts . Then, node only stores a small part of the measurement matrix and receives the corresponding measurement data . The information of a node can be transmitted within its neighbourhood (denoted as red links). Although each node has insufficient numbers of measurements and can only exchange information within local neighbours, the information diffusion across the whole network provides the ability to access the whole information of .

## Iii Proposed Algorithms For Compressive sensing

### Iii-a Gradient descent Dl0-LMS for CS

The diffusion adaptation algorithms for stream data has been intensively studied [28, 32, 39, 40]. Given the temporal sparse input data sequence and the corresponding output data sequence at node , the sparse diffusion LMS adaptation algorithm [32] obtains the estimation by minimizing the following global cost function [32]

 Jglob(w)=P∑k=1E[(dk(i)−uTk(i)w)2]+ξ∥w∥0 (5)

Intuitively, we can define and as

 {uk(i)}={θk(1),θk(2),...,θk(Lk)}
 {dk(i)}={yk(1),yk(2),...,yk(Lk)}

where and denote the transpose of the th row of and the -th scalar of , respectively. Thus the solution to CS reconstruction problem in Eq.(4) can be formulated based on the the diffusion adaptation algorithm with cost function in Eq.(5).

In traditional diffusion algorithm, the data size is always much larger than input dimension . However, in CS is much smaller than . When directly apply the sparse diffusion algorithm to CS, the adaptation process may not converge to the steady state due to insufficient number of data. To solve this problem, we follow the method described in [33] and use the data recursively. In particular, the data used at -th iteration in node are

 uk(i)=θk(mod(i,Lk)+1) (6) dk(i)=yk(mod(i,Lk)+1)

Therefore, combining diffusion adaptation strategy and modified data sequence in Eq.(6), we can derive two gradient-descend based diffusion adaptive algorithms for CS, namely, the Adapt-then-Combine (ATC) diffusion -LMS (ATC-D-LMS) algorithm

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩ψk(i+1)=wk(i)+μk∑l∈Nkαl,k^gl(wk(i))                     −μkξ∇∥wk(i)∥0wk(i+1)=∑l∈Nkβl,kψl(i+1) (7)

and the Combine-then-Adapt (CTA) diffusion -LMS (CTA-D-LMS) algorithm

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩φk(i)=∑l∈Nkβl,kwl(i)wk(i+1)=φk(i)+μk∑l∈Nkαl,k^gl(φk(i))                     −μkξ∇∥φk(i)∥0 (8)

where and are intermediate vectors of node at time , is the derivation of and

 ^gl(xk(i))=(dl(i)−xTk(i)ul(i))ul(i) (9)

is the instantaneous gradient vector. is the neighbourhood of node and is the corresponding step size. are non-negative weights assigned to link between and for adaptation and combination step, respectively. Further, can be seen as the -th entries of matrices and respectively. Specifically, and should satisfy

 S1=AT1=1 (10)

and

 S(l,k)=A(l,k)=0  if  l∉Nk (11)

where denotes the all one vector.

An important problem left is to calculate . Since is non-differentiable, one should use an approximation instead. There are several approximations of norm [41] which can work well for sparse identification [33, 32]. In this paper we use the zero attraction term similar to -LMS, which is defined as [33]

 −∇∥w(i)∥0 ≈zδ(w(i)) (12) =[zδ(w1(i)),zδ(w2(i)),...,zδ(wN(i))]T

where

 zδ(wm(i))=⎧⎪⎨⎪⎩δ2wm(i)+δ  −1/δ≤wm(i)<0δ2wm(i)−δ     01/δ (13)

and is the zero attraction parameter.

The pesudo code of ATC-D-LMS is summarized in Algorithm 1. At iteration , each node sends the data pair to its neighbours. Then the adaptation is performed at each node. After adaptation, the estimation in each node is transferred to its neighbours for combination. The process of CTA-D-LMS is similar with ATC-D-LMS except that the order of adaptation step and combination step are reversed.

Remark 1: Consider a special case that the network size . In this case, the reconstruction is processed in a stand-alone manner, and Eq.(7) and Eq.(8) will be equal to

 w(i+1)=w(i)+μ(d(i)−xT(i)u(i))u(i)+μξzδ(w(i)) (14)

which is the typical -LMS for CS [33]. Therefore, -LMS can be viewed as a special case of ATC-D-LMS and CTA-D-LMS.

Remark 2: In a strict sense, the name D-LMS is nonstandard since the algorithm actually minimize an approximation of norm. In this paper, we just use the D-LMS to keep the name consistent with -LMS for CS [33].

Remark 3: For large scale data, to reduce the amount of network transmission, one can put away the adaptation information exchange, i.e. [42]. That is, each node only utilize its own data to perform adaptation, and then share its estimation to neighbours for combination. Simulation results in Section V will verify the feasibility of this strategy.

### Iii-B Mini-batch Dl0-LMS for CS

The proposed gradient-based D-LMS is a typical extension of traditional sparse diffusion algorithm. We should notice that unlike traditional diffusion adaptation, in CS all the data is already known. Therefore, one can utilize more information during each iteration. In [33], the regularized exponetially forgetting windows LMS (-EFWLMS) algorithm is proposed to improve the convergence speed of -LMS. Extended from affine projection algorithm(APA) [43], -EFWLMS utilizes a sliding window approach to use more data to improve the gradient estimation. However, -EFWLMS still follows the traditional adaptive filtering method.

In data regression problem, mini-batch gradient descent (MBGD) method has been widely used. MBGD selects a small part of the sample data, computes gradient for each data, and then calculates the average gradient as the gradient estimation. For diffusion CS, the input data for node at each iteration can be chosen as

 Uk(i)=[θ(r(1)),θ(r(2)),...,θ(r(Q))]T (15) Dk(i)=[y(r(1)),y(r(2)),...,y(r(Q))]T

where is the index vector whose elements are uniformly and randomly chosen from . Then, according to MBGD method, the average gradient is defined as

 ^Gl(xk(i))=1QUl(i)(Dl(i)−xTk(i)Ul(i)) (16)

Then, the weight update of corresponding ATC mini-batch diffusion -LMS algorithm (ATC-MB-D-LMS) can be represented as

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩ψk(i+1)=wk(i)+μk∑l∈Nkαl,k^Gl(wk(i))                     +μkξzδ(w(i))wk(i+1)=∑l∈Nkβl,kψl(i+1) (17)

where the instantaneous gradient in Eq.(7) is replaced by . The corresponding CTA mini-batch diffusion -LMS algorithm (CTA-MB-D-LMS) can be also simply derived from Eq.(8).

In real application, utilizing mini-batch method gives faster convergence speed than D-LMS, but may also cause instability when the step size is large. To alleviate the negative impact caused by MBGD method, we optimize the iterative process by constraining the sparsity variance during the convergence process. In particular, the sparsity of the estimation in at -th iteration is defined as

 s(w(i))=N∑j=1f(wj(i)) (18)

where

 f(x)={1,|x|>τ0,|x|<τ

and is a small positive threshold. Then, after a small number of iterations , if the sparsity of -th iteration is larger than 1.5 times of the sparsity at -th iteration, the estimation will not update.

The pesudo code of ATC-MB-D-LMS is summarized in Algorithm 2. Unlike Algorithm 1 where is transmitted within neighbours, to alleviate the load of network transmission, here the estimation is transmitted to neighbours. The gradient is computed at neighbour nodes and then sent back. Similar to ATC-D-LMS, one can also set to put away the adaptation step to save the amount of network transmission.

### Iii-C Data Partition Strategy and Stop criterion

For a connected network, using diffusion based algorithm allows each node to observe all the information of the data, thus in theory for any partition strategy the sparse signal can always be recovered. However, as discussed later in Part E, Section IV, the data correlation may influence the convergence condition on step size which is directly related to reconstruction speed. Since the data is recursively used, the data correlation may occur every iterations. Therefor, to avoid high correlation of data, the partition strategy should selected so that for each node are as large as possible. In practice, uniformly assign the data to each node is a proper choice.

Although one can use the maximum iteration number to stop the iteration, one would like a more practical stop criterion. In [33], the author utilizes the squared error between adjacent estimation as the index of stop condition. However, it is always not operational in real applications. Observing the fact that the sparsity of the estimation will maintain as the algorithm converges to the steady state, here we propose a new stop criterion based on the sparsity of the estimation: given the window length and the threshold , by defining the count at iteration

 Sc(i)=i∑j=i−Ls+1p(s(w(j))) (19)

where

 p(x)={1,smin≤x≤smin+ps0,x>smin+ps

and , if , then the algorithm will stop.

## Iv Convergence Analysis

In this section, we carry out the convergence analysis of D-LMS algorithm for CS reconstruction. We first derive the recursion form of the algorithm, and then analyze the sufficient condition for convergence on step size under different parameter settings. Finally the influence of regularization term and data reuse is also discussed.

### Iv-a Recursion form derivation

For simplicity, similar to [28, 32], we carry out the mean-square analysis on the following general diffusion framework structure

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩φk(i)=∑l∈Nkβl,kwl(i)θk(i+1)=φk(i)+μk∑l∈Nkαl,k^gl(φk(i))+μkξzδ(w(i))wk(i+1)=∑l∈Nkγl,kθl(i+1) (20)

where can be seen as the -th entries of matrices and and respectively. ATC-D-LMS and CTA-D-LMS can be viewed as two special cases by setting and , respectively.

Subtracting both sides of Eq.(20) from the desired sparse vector , we can obtain

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩~φk(i)=∑l∈Nkβl,k~wl(i)~θk(i+1)=~φk(i)−μk∑l∈Nkαl,k^gl(φk(i))−μkξzδ(w(i))~wk(i+1)=∑l∈Nkγl,k~θl(i+1) (21)

where , , are error vectors. Therefore, by defining

 w(i)=col{wk(i)}Pk=1,~w(i)=col{~wk(i)}Pk=1 A1=A1⊗IN,A2=A2⊗IN,S=S⊗IN D=diag{μk}Pk=1,D=D⊗IN V(i)=diag{vk(i)}Pk=1,V(i)=V(i)⊗IN H(i)=∑Pl=1diag{αl,kul(i)ul(i)T}Pk=1 s(i)=STVT(i)col{uk(i)}Pk=1

one can obtain the recursion from Eq.(21)

 ~w(i+1)=AT2[I−DH]AT1~w(i)−AT2Ds(i)−ξAT2Dzδ(w(i)) (22)

### Iv-B Sufficient condition for convergence under deterministic measurement matrix

Based on the recursion in Eq.(22) and property that the data is recursively used, one can further derive the following theorem.

###### Proposition 1

Given a certain measurement matrix and matrices , , , for any finite initial vector and finite noise vector , define the product

 Γ=Lm∏n=1AT1[I−DH(n)]A2 (23)

where is the least common multiple of , and define the sequence

as the eigenvalues of

arranged from large to small according to the modulus. For arbitrary initial condition, in the diffusion algorithm in Eq.(22) will converge if the step sizes are selected so that the modulus of the -th eigenvalue is less than 1, i.e.

 |λN−M+1|<1 (24)
###### Proof 1

See Appendix A.

The above result can be used in practical since the generated measurement matrix is always fixed during the specific CS task. Different from traditional diffusion process, once the data has been collected, the process is deterministic and there is no more randomness in the operation of the reconstruction.

### Iv-C Sufficient condition for convergence on random measurement matrices

In many situations, the measurement matrix may not always fixed during all reconstruction process. However, since the measurement matrix is always selected as the random matrix, we can follow the method of traditional diffusion algorithm [26, 27, 28, 29, 30] and analyze the convergence of the proposed algorithm in mean and mean squares sense.

Similar to traditional diffusion algorithms, for tractability of the analysis we use the following assumptions:
A1. The elements of noise vector are i.i.d processes and independent of measurement matrix .
A2. of each node is sufficient large so that at arbitrary node is independent of .
A3. The noise has finite variance.

We should remark that in CS reconstruction, due to recursively use of data, Assumption A2 may not be satisfied in practice. Nevertheless, if is sufficient large so that the correlation is sufficient small, the independent assumption can almost reach. This fact has been proved by simulation results of -LMS for CS [33]. Moreover, one should also notice that the small step size assumption in analysis of traditional diffusion algorithms is removed. In [26, 27, 28, 29, 30], the analysis is based on the assumption that the step sizes are sufficient small so that the higher-order powers of step sizes can be ignored. However, in CS a larger step size is preferred to achieve faster convergence speed. Thus in this paper we eliminate this assumption and propose a new analysis on mean and mean squares performance. In the analysis, without loss of generality, the variances for each element of are all set to .

Postmultipling both sides of Eq.(22) with their respective transposes, and then utilizing trace operation, we can obtain the following weighted mean square relation

 E[∥~w(i+1)∥2Σ]= E[∥~w(i+1)∥2Λ]+tr{ΣU(i)} (25)

where

 U(i)= ξ2AT2DK(i)DA2+AT2DSTP(i)SDA2 (26) −ξAT2(I−1MDG)AT1J(i)DA2 −ξAT2DJ(i)A1(I−1MDG)A2
 Λ= A1(I−1MGD)A2ΣAT2(I−1MDG)AT1 (27) +N+1M2P∑k=1A1TkDA2ΣAT2DTkAT1

with

 G=∑Pl=1diag {αl,k}Pk=1,G=G⊗IN (28) Tk=diag {αl,k}Pl=1,Tk=Tk⊗IN J(i) =E[~w(i)zT(w(i))] K(i) =E[z(w(i))zT(w(i))] P(i) =E[V(i)TV(i)]

Using the relationship of vectorization operator, matrix trace and Kronecker product

 vec{ABC} =(CT⊗A)vec{B} (29) tr{ATB} =vec{B}Tvec{A}

by defining , Eq.(25) can be derived as

 E[∥~w(i+1)∥2σ]= E[∥~w(i+1)∥2Fσ]+vec{UT(i)}Tσ (30)

where and denotes the same quantity, and

 F= (A1⊗A1)[(I−1MGD)⊗(I−1MGD) (31) +N+1M2M∑k=1(TkD⊗TkD))(A2⊗A2)

It is easy to prove that and are always bounded. Moreover, for one can obtain

 |J(i)|=|E[~w(i)zT(w(i))]|≤1δ|E[~w(i)1T]| (32)

Therefore, to ensure the bounded property of , should be always bounded. This recalls the mean recursion relation, which is obtained by taking expectation of both sides of Eq.(22)

 E[~w(i+1)]=QE[~w(i)]−ξAT2DE[z(w(i))] (33)

where

 Q=AT2(I−1MGD)AT1 (34)

It is known that will be bounded for all if converges as . Moreover, it has been proven that the stability of and can ensure the convergence of Eq.(25) and Eq.(33), respectively [32]. Thus, for arbitrary the sufficient condition for both mean and mean square stability of Eq.(25) should be

 ρ(F)<1  and  ρ(Q)<1 (35)

where denotes the spectral radius of matrix .

To further simplify the condition in Eq.(35), we propose the following theorem:

###### Theorem 1

For arbitrary real square matrices sequence , the following inequality will always hold

 ρ(B1⊗B1)≤ρ(t∑k=1(Bk⊗Bk)) (36)
###### Proof 2

See Appendix B.

According to Theorem.1, if we set and for , we can obtain

 ρ(GT⊗GT)≤ρ(F) (37)

Moreover, according to eigenvalue relationship of kronecker product, we can obtain . Therefore, we have the following conditions

 ρ(F)<1⟹ρ(QT⊗QT)<1⟹ρ(Q)<1 (38)

Therefore, we can conclude that under Assumptions A1-A3, the condition will guarantee both mean and mean-square stability of the proposed algorithm.

For large scale data, computing the eigenvalue of is hard since the complexity grows significantly as increases. To simplify the computation of , we further propose the following theorem:

###### Theorem 2

Given sum of arbitrary real square matrices sequence

 B=t∑k=1(Bk⊗Bk) (39)

By defining

 B⊗IN=t∑k=1((Bk⊗IN)⊗(Bk⊗IN)) (40)

where

is the arbitrary identity matrix. Then, we will have

 ρ(B⊗IN)=ρ(B) (41)
###### Proof 3

See Appendix C.

Rewriting Eq.(31) we can obtain

 F= [A1(I−1MGD)A2⊗I]⊗[A1(I−1MGD)A2⊗I] (42) +N+1M2M∑k=1[(A1TkDA2⊗I)⊗(A1TkDA2⊗I)]

Thus, according to Theorem.2, one can compute by simply set . Specifically, by defining

 F= (43) +N+1M2P∑k=1(TkD⊗TkD))(A2⊗A2)

we will have . Therefore, instead of calculating the spectrum radius of with dimensions, we can simplify obtain from which has only dimensions. In practical, the is typically a sparse matrix. Thus we can use an computationally-efficient search technique [44, 45] to find , which is easy to implement.

### Iv-D Further analysis under general parameter settings

Under Assumption A1-A3, gives the sufficient condition for the convergence of the proposed algorithm. Moreover, in practical diffusion adaptation, the step size of all nodes are always set to the same value, and

is always set to doubly stochastic matrix

[26, 27, 28, 29, 30, 46]. Therefore, in the following analysis we set where is the identical step size for all nodes. The is set to doubly stochastic such that . Without loss of generality, here we only analysis ATC strategy, such that . Thus Eq.(43) can be further simplified as

 F= [(1−μM)2I+N+1M2μ2M∑