# Convergence Analysis of a Cooperative Diffusion Gauss-Newton Strategy

In this paper, we investigate the convergence performance of a cooperative diffusion Gauss-Newton (GN) method, which is widely used to solve the nonlinear least squares problems (NLLS) due to the low computation cost compared with Newton's method. This diffusion GN collects the diversity of temporalspatial information over the network, which is used on local updates. In order to address the challenges on convergence analysis, we firstly consider to form a global recursion relation over spatial and temporal scales since the traditional GN is a time iterative method and the network-wide NLLS need to be solved. Secondly, the derived recursion related to the networkwide deviation between the successive two iterations is ambiguous due to the uncertainty of descent discrepancy in GN update step between two versions of cooperation and non-cooperation. Thus, an important work is to derive the boundedness conditions of this discrepancy. Finally, based on the temporal-spatial recursion relation and the steady-state equilibria theory for discrete dynamical systems, we obtain the sufficient conditions for algorithm convergence, which require the good initial guesses, reasonable step size values and network connectivity. Such analysis provides a guideline for the applications based on this diffusion GN method.

## Authors

• 1 publication
• 15 publications
• 1 publication
01/05/2020

### A local character based method for solving linear systems of radiation diffusion problems

The radiation diffusion problem is a kind of time-dependent nonlinear eq...
02/28/2018

### An Event-based Diffusion LMS Strategy

We consider a wireless sensor network consists of cooperative nodes, eac...
11/27/2019

### On the choice of initial guesses for the Newton-Raphson algorithm

The initialization of equation-based differential-algebraic system model...
10/07/2019

### All-at-once versus reduced iterative methods for time dependent inverse problems

In this paper we investigate all-at-once versus reduced regularization o...
04/06/2020

This work proposes a novel strategy for social learning by introducing t...
03/26/2019

### On the Performance of Exact Diffusion over Adaptive Networks

Various bias-correction methods such as EXTRA, DIGing, and exact diffusi...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Gauss-Newton method has found wide applications, such as deep learning in artificial intelligence and neural network

[1, 2], and parameter estimate in a networked system [3, 4, 5]. Deriving from Newton’s method, GN algorithm discards the second-order terms in the computation of Hessian for small residual NLLS problems, thereby resulting in saving in computation. Such the amount of computations can be further reduced via the mathematical process. In order to compute easily the first derivative of objective function, the perturbed GN method is proposed in [6], where a perturbed derivative version substitutes the original one. The truncated GN method [7] is proposed to implement the inexact update instead of exact one. The truncated-perturbed GN method [7] integrates the above two advantages into the update step.

Many scenarios can be modeled as the NLLS problem depended on the performance of GN, such as computer vision

[8], image alignment and reconstruction [9, 10], network-based localization [11, 12], signal processing for direction-of-arrival estimation and frequency estimation [13][14] and power system state estimation [4, 15, 16]. Despite the widespread utility, it is difficult for exploiting the original GN method as a fully cooperative scheme for a distributed network, since its iteration rule involves the matrix inverse operator, which is ideally suited to be implemented in a centralized way. However, for the well known advantages such as load balancing and robustness, distributed algorithm with the improvement of performance is preferred.

The purpose of this work is to analyze the convergence of a cooperative diffusion GN strategy over a distributed network, where every node sense the temporal data that is variable over the spatial domain. Several diffusion GN methods [17, 18] are proposed for solving the localization problem in wireless sensor networks. However, they are centralized in nature and implemented in a non-cooperative way, in which the local intermediate estimates are not shared over the diffusion network.

Notation: The operator

denotes the transpose for matrix or vector, the operator

denotes the inverse of a non-singular matrix. The capital letters are used when the matrices are denoted, while the small letters are used when the vectors or scalars are denoted. The Euclidean norm of a vector is written as , 2-norm and Frobenius norm of a matrix is denoted by and , respectively. and denote the identity matrix and vector whose every entry is 1, respectively. We will use subscripts , , and to denote node, and superscript , to denote time.

## Ii Description of cooperative diffusion Gauss-Newton solution

### Ii-a Centralized solution

For an adaptive network represented by a set , we would like to estimate a unknown parameter vector belonging to a closed convex set . Let be a continuous and differentiable global cost function throughout the network, where is the individual cost function associated with node by collecting the measurements from the related events. The estimation problem can be formulated as

 minx∥f(x)∥2. (1)

By rewriting , the object of each node in the network is to seek a vector that solve the following Non-Linear Least Squares (NLLS) problem with the form

 minxN∑k=1|fk(x)|2. (2)

The GN method is well recognized for solving NLLS problems. Let us consider a fusion center (FC) that can communicate with all nodes in the network. Given an initial good guess , a centralized scheme can be implemented on FC based on the GN update rule in an iterative way

 xi+1=xi−αidi, (3)

where is the estimation of at iteration , denotes a descent direction of GN, and is the step size parameter that ensure is nearer a stationary point than .

In this paper, we adopt the following assumption for the above optimization problem.

Assumption 1.

(1) The stationary points that satisfy

 ∇f(xs)=2FT(xs)f(xs)=0

always exist, where is the Jacobian of with the size and the entries , .

(2) The notations and

are denoted as the minimum and maximum eigenvalues. For all

and , let

 Σmin=min√λmin(FT(x)F(x))
 Σmax=max√λmax(FT(x)F(x)),

where .

Under Assumption 1, the approximate Hessian of is positive definite. Thereby, a local minimizer of denoted by that belongs to the set of stationary points always exist [19, 20]. Thus, the descent direction of GN update is written as

 di=[FT(xi)F(xi)]−1FT(xi)f(xi). (4)

By rewriting

 F(x)=col{∂f1(x)∂x,∂f2(x)∂x,⋯,∂fN(x)∂x}(N×M) (5)

and defining

 Fk(x)≜∂fk(x)∂x,(1×M) (6)

we get

 di=[N∑k=1FTk(xi)Fk(xi)]−1N∑k=1FTk(xi)fk(xi). (7)

Therefore, we have the following GN iteration update

 xi+1=xi−αi[N∑k=1FTk(xi)Fk(xi)]−1N∑k=1FTk(xi)fk(xi). (8)

To successfully implement (8) in a centralized way, we assume that the FC can communicate with all nodes over network and the same initial estimate is given by . In the centralized GN algorithm, the computation results of and from each node are aggregated by the FC to obtain the new estimate based on (8). Then the estimate is returned to all nodes until an appropriate termination condition is satisfied, for example or , where and are the predefined minimum norm decline and the maximum number of iterations, respectively. Thus, the centralized GN includes actually a step of diffusion for new estimate form FC to individual nodes.

In this paper, we adopt the constant step size for the subsequent development and analysis.

### Ii-B Diffusion Gauss-Newton

Consider the adaptive network , where any node at time receives a set of estimates from all its 1-hop neighbors including itself. Thus, the local estimates is combined in a weighted combination way denoted by

 Xik=∑l∈Nkcklxil, (9)

where is the weighted coefficient between node and . And the conditions

 ∑l∈Nkckl=1andckl∈[0,1]forl∈Nk (10)

is satisfied.

Once the aggregate estimate is obtained as the local weighted estimate, any node in the network can implement the GN update step as follows:

 xi+1k=Xik−α[Qik(X)]−1qik(X), (11)

where we define

 Qik(X) ≜FTl∈Nk(Xil)Fl∈Nk(Xil) (12) ≜∑l∈NkFTl(Xil)Fl(Xil)

and

 qik(X) ≜FTl∈Nk(Xil)fl∈Nk(Xil) (13) ≜∑l∈NkFTl(Xil)fl(Xil).

Removing the aggregate step of diffusion GN algorithm, we obtain a non-cooperative diffusion GN algorithm, where each node in the network acts as the FC to implement the centralized GN by communicating with all immediate neighbors. Its GN update step is given by

 xi+1k=xik−α[Qik(xik)]−1qik(xik), (14)

where we define

 Qik(xik) ≜FTl∈Nk(xik)Fl∈Nk(xik) (15) ≜∑l∈NkFTl(xik)Fl(xik)

and

 qik(xik) ≜FTl∈Nk(xik)fl∈Nk(xik) (16) ≜∑l∈NkFTl(xik)fl(xik).

Note that the expression on arguments in (12) (13) (15) (16) shows the main difference between cooperative and non-cooperative algorithms.

The question that remains is how well does the diffusion GN algorithm perform in terms of its expected convergence behavior. First, what are the sufficient conditions of convergence for the diffusion GN algorithm? Second, is better the diffusion GN algorithm on convergence, compared with its non-cooperative counterpart? In other words, what are the benefits of cooperation? The following analysis and simulations will answer the above questions.

## Iii Convergence analysis

### Iii-a Assumptions and data model

To proceed the analysis, several reasonable assumptions need to be given as is commonly done in the literature [4, 21].

Assumption 2.

(1) is bounded for all near , and satisfies

 ∥fl∈Nk(xik)∥≤emax

and

 ∥fl∈Nk(x∗)∥=emin,

where denotes the minimum value of when evaluated at .

(2) For all and , let

 σmin=min√λmin(FTk(x)Fk(x))

and

 σmax=max√λmax(FTk(x)Fk(x)),

where

(3) Both and are Lipschitz continuous on with Lipschitz constant such that

 ∥Fl∈Nk(x)−Fl∈Nk(y)∥≤ω∥x−y∥

and

 ∥Fk(x)−Fk(y)∥≤ω∥x−y∥

for all . Furthermore, we have the following results [22]

 ∥FTk(x)fk(x)−FTk(y)fk(y)∥≤γf∥x−y∥

and

 ∥FTk(x)Fk(x)−FTk(y)Fk(y)∥≤γF∥x−y∥,

where and are the corresponding Lipschitz constants.

In addition, the studying of the local convergence behavior need to be considered from the global view of network, since the performance of individual node depends on the whole network including cooperation rule and network topology. Thus, we introduce the global quantities

 xiG≜col{xi1,…,xiN},(NM×1)
 ¯¯¯x∗≜col{x∗,…,x∗},(NM×1)
 DiG≜col{Di1,…,DiN},(NM×1)
 diG≜col{di1,…,diN},(NM×1)

where

 Dik≜[Qik(X)]−1qik(X),k∈N,

and

 dik≜[Qik(xik)]−1qik(xik),k∈N,
 A(xiG)≜diag{Fl∈N1(xi1),…,Fl∈NN(xiN)},(NN×NM)
 A(¯¯¯x∗)≜diag{Fl∈N1(x∗),…,Fl∈NN(x∗)},(NN×NM)
 b(xiG)≜col{fl∈N1(xi1),…,fl∈NN(xiN)},(NN×1)
 b(¯¯¯x∗)≜col{fl∈N1(x∗),…,fl∈NN(x∗)},(NN×1)

where is a block diagonal matrix whose entries are those of the column vector .

An aggregate matrix can be given with non-negative real entries that is redefined with the following conditions

 ckl=0ifl∉NkandN∑l=1ckl=1,ckl≥0. (17)

Conditions (17) indicate that the sum of all entries on each row of the matrix is one, while the entry of shows the degree of closeness between nodes and . We will see the influence of selecting on the performance of the resulting algorithms in later simulations.

Similarly, we introduce an adjacency matrix with the element , in which if node is linked with node ; otherwise 0.

We also introduce the extended aggregate matrix

 G≜C⊗IM,(NM×NM)

where is the Kronecker product operation and is the identity matrix.

### Iii-B Temporal-spatial recursion relation

The temporal-spatial relation across network need to be considered as a starting point of convergence analysis. First, the diffusion strategy leads to the frequent spatial interaction between the neighborhoods, thereby each node is influenced by both local information such as and spatial information from neighbours such as . Second, the iteration way decides that the estimates and the local collected information on each node are time-variant, i.e., .

To begin with (9), we have

 XiG=GxiG. (18)

Using (18), we rewrite the local diffusion GN update step (11) as a global representation

 xi+1G=GxiG−αDiG. (19)

Accordingly, we get the global non-cooperative GN update step

 xi+1G=xiG−αdiG. (20)

Subtracting on both sides of the equation (19) and embedding the equation (20), we get

 xi+1G−¯¯¯x∗=(GxiG−xiG)+(xiG−¯¯¯x∗−αdiG)+α(diG−DiG). (21)

Using the triangle inequality for vectors, we get the following recursion

 ∥xi+1G−¯¯¯x∗∥≤∥GxiG−xiG∥+∥xiG−¯¯¯x∗−αdi∥ (22) +α∥DiG−diG∥.

The inequality (22) can be regarded as a temporal-spatial recursion relation, where the superscript and the subscript reflect the evolution of diffusion GN algorithm from temporal and spatial dimensions, respectively. And we establish the relation between diffusion GN and non-cooperative diffusion algorithms from the global perspective.

For the first term of the right side of (22), we have

 ∥GxiG−xiG∥ =∥GxiG−G¯¯¯x∗+(¯¯¯x∗−xiG)∥ (23) ≤∥GxiG−G¯¯¯x∗∥+∥xiG−¯¯¯x∗∥ ≤∥G∥F∥xiG−¯¯¯x∗∥+∥xiG−¯x∗∥ =(∥G∥F+1)∥xiG−¯¯¯x∗∥,

where we use based on the property of .

For the second term of the right side of (22), we have the following conclusion.

Lemma 1. Let Assumptions 1 and 2 hold. The norm of global vector satisfies the following recursion

 ∥xiG−¯¯¯x∗−αdi∥≤t1∥xiG−¯¯¯x∗∥2+t2∥xiG−¯¯¯x∗∥, (24)

where

 t1≜αω2Σmin,t2≜(1−α)ΣmaxΣmin+√2NαωeminΣ2min (25)

Proof: See Appendix A.

Given (23) and (24), we rewrite the temporal-spatial recursion relation (22) as

 ∥xi+1G−¯¯¯x∗∥ ≤t1∥xiG−¯¯¯x∗∥2 (26) +(t2+∥G∥F+1)∥xiG−¯¯¯x∗∥+α∥DiG−diG∥.

Given the above, the left side of (26) is the network deviation at time , while the right side of (26) will be related to the network deviation at time if we can confirm that shares the same character or is bounded by a given constant . Then, we can establish the relation of the network deviation between the successive two times in diffusion GN.

### Iii-C Boundness of descent discrepancy

denotes the GN descent discrepancy over network between two modes of cooperative and non-cooperative. To decide the boundness of the discrepancy, we first evaluate the entry of , i.e., .

To begin the process, we write the entry as

 Dik−dik=[Qik(X)]−1qik(X)−[Qik(xik)]−1qik(xik),k∈N. (27)

Because of the matrix inverse operator, we introduce two quantities

 Sik≜Qik(X)−Qik(xik) (28) sik≜qik(X)−qik(xik).

And in order to lower the impact of inverse operator for our analysis, the known matrix expansion formula [21] will be used frequently in our analysis. That is

 (Z+δZ)−1=∞∑u=0(−1)u(Z−1δZ)uZ−1 (29)

for any matrix and if .

From (9), is a convex combination of for . Thus, Assumptions 1 and 2 hold for .

Then we have

 ∥Sik∥ =∥∑l∈Nk[FTl(Xil)Fl(Xil)−FTl(xik)Fl(xik)]∥ (30) ≤∑l∈NkγF∥Xil−xik∥

and

 ∥sik∥ =∥∑l∈Nk[FTl(Xil)fl(Xil)−FTl(xik)fl(xik)]∥ (31) ≤∑l∈Nkγf∥Xil−xik∥.

From (30) and (31), both and depend on . We now study the boundness of . Before that, we define a vector

 cl≜row{cl1,cl2,…,clN},l∈N

which is the row of matrix .

Evaluating the norm of , we get

 ∥Xil−xik∥ =∥clxiG−cl\textmd1Nxik∥ (32) ≤∥cl∥∥xiG−\textmd1Nxik∥ ≤∥xiG−\textmd1Nxik∥.

The block quantity represents the estimate difference across the network at time and is written by

 xiG−\textmd1Nxik=col{xi1−xik,xi2−xik,…,xiN−xik}

whose individual entry is a vector.

For the norms of and , and , we have the following Lemmas.

Lemma 2. Let Assumptions 1 and 2 hold. The estimate difference between nodes and through the non-cooperative GN update (14) is bounded by

 ∥xil−xik∥≤Πi,i≥1 (33)

where

 Πi≜a2i∑j=1(a1)j−1, (34)
 a1≜1+αnkl+2αnklγf2nlσ2min, (35)
 a2≜(nl+3nk|l+3nl|k)ασmaxεmax2nlσ2min, (36)

denotes the number of nodes that are both in and , denotes the number of nodes that are in and not in .

Proof: See Appendix B.

Lemma 3. Let Assumptions 1 and 2 hold. The estimate difference between nodes and through the diffusion GN update (11) is bounded by

 ∥Xil−xik∥≤NΠi,i≥1, (37)

and

 ∥[Qik(xik)]−1Sik∥<1 (38)

always holds under the sufficient condition

 nkl>0, (39)

where , , and are assigned by (35), (36), (15) and (28), respectively.

Proof: See Appendix D.

The condition (39) means that any two nodes and in the network have at least one common neighboring node, which is more likely to be achieved by a small and dense network. However, the condition can be relaxed in practice by allowing that all nodes are linked over single-hop or multi-hops so that it holds for the large scale networks. Thus, it is reasonable that the sufficient condition for applying the expansion formula in (27) always holds under Lemma 2.

Thus, we use the expansion formula (29) and the norm operator on (27) as follows

 ∥Dik−dik∥ (40) =∥[Qik(xik)+Sik]−1[qik(xik)+sik]−[Qik(xik)]−1qik(xik)∥ =∥[∞∑u=0(−1)u((Qik(xik))−1Sik)u][Qik(xik)]−1[qik(xik)+sik] −[Qik(xik)]−1qik(xik)∥ =∥[∞∑u=1(−1)u((Qik(xik))−1Sik)u][Qik(xik)]−1[qik(xik)+sik] +[Qik(xik)]−1sik∥ ≤∥[Qik(xik)]−1sik∥+ ∥[∞∑u=1((Qik(xik))−1Sik)u]∥∥[Qik(xik)]−1∥(∥qik(xik)∥+∥sik∥) ≤NγfΠiσ2min+(σmaxεmax+NγfΠi)ζiσ2min(1−ζi),

where the last equality comes from the obtained results including (105) (107) (112) (113) and the definitions (34) (114) of and (see Appendixes C and D). From (96), we know that is a bounded quantity that depends on the network topology.

Finally, we obtain the boundness conclusion as follows

 ∥DiG−diG∥≤N∥Dik−dik∥ (41) ≤N2γfΠiσ2min+(Nσmaxεmax+N2γfΠi)ζiσ2min(1−ζi)≜ξ.

### Iii-D Sufficient conditions for system convergence

Giving the constant that satisfies (41), we rewrite the global recursion relation (26) as

 ∥xi+1G−¯¯¯x∗∥ (42) ≤t1∥xiG−¯¯¯x∗∥2+(t2+∥G∥F+1)∥xiG−¯¯¯x∗∥+αξ,

which can be regarded as a nonlinear discrete dynamical system. Let , we will simplify notation of (42) with the general form

 yi+1≤t1(yi)2+(t2+∥G∥F+1)yi+αξ, (43)

whose steady-state equilibrium is a level [23] that solves

 y=ϕ(y)=t1y2+(t2+∥G∥F+1)y+αξ. (44)

With this expression it is easy to know that the global error is determined by the dynamical system since . Thus, guaranteing the stability of system will be needed.

Solving (44), we get two steady-state equilibrium points as follows

 ymax=−t2−∥G∥F+√(t2+∥G∥F)2−4t1αξ2t1 (45)

and

 ymin=−t2−∥G∥F−√(t2+∥G∥F)2−4t1αξ2t1 (46)

with the condition

 (t2+∥G∥F)2−4t1αξ≥0. (47)

The equilibrium points of the dynamical system (43) is stable if and only if [23]

 ∣∣d(ϕ(y))y∣∣<1, (48)

where is the first order derivative of .

Thus, we know that is unstable since

 ∣∣d(ϕ(y))y∣∣ymax∣∣=∣∣1+√(t2+∥G∥F)2−4t1αξ∣∣>1, (49)

while can be stable if

 ∣∣d(ϕ(y))y∣∣ymin∣∣=∣∣1−√(t2+∥G∥F)2−4t1αξ∣∣<1 (50)

holds.

Because of

 ∥G∥F= ⎷N∑k=1N∑l=1M(ckl)2≤ ⎷MN∑k=1(N∑l=1ckl)2=√MN, (51)

from (50), we get the following constraints

 2√t1αξ−t2<∥G∥F

and

 max{(t2+∥G∥F)2