## I Introduction

This work considers stochastic optimization problems where a collection of networked agents work cooperatively to solve an aggregate optimization problem of the form

(1) |

The local risk function held by agent

is differentiable and strongly convex, and it is constructed as the expectation of some loss function

. The random variable

represents the streaming data received by agent , and the expectation in is over the distribution of . While the cost functions may have different local minimizers, all agents seek to determine the common global solution under the constraint that agents can only communicate with their direct neighbors. Problem (1) can find applications in a wide range of areas including wireless sensor networks [5, 6], distributed adaptation and estimation

[7, 8, 9], and distributed statistical learning [10].There are several techniques that can be used to solve problems of the type (1) such as consensus [11, 12, 13] and diffusion [9, 7, 8] strategies. However, the class of diffusion strategies has been shown to be particularly well-suited due to their enhanced stability range over other methods, as well as their ability to track drifts in the underlying models and statistics. We therefore focus on this class of algorithms since we are mainly interested in methods that are able to learn and adapt from continuous streaming data. For example, the adapt-then-combine formulation of diffusion takes the following form:

(2) | ||||

(3) |

where the subscript denotes the agent index and denotes the iteration index. The variable is the data realization observed by agent at iteration . The scalar is the weight used by agent to scale information received from agent , and is the set of neighbors of agent (including itself). In (2)–(3), variable is an intermediate estimate for at agent , while is the updated estimate. Note that step (2) uses the gradient of the loss function, , rather than its expected value . This is because the statistical properties of the data are not known beforehand. If

were known, then we could use its gradient vector instead in (

2). In that case, we would refer to the resulting method as a deterministic rather than stochastic solution. Throughout this paper, we employ a constant step-size to enable continuous adaptation and learning in response to drifts in the location of the global minimizer due to changes in the statistical properties of the data. The adaptation and tracking abilities are crucial in many applications, see examples in [8].Previous studies have shown that both consensus and diffusion methods are able to solve problems of the type (1) well for sufficiently small step-sizes. That is, the squared error approaches a small neighborhood around zero for all agents, where . Note that these methods do not converge to the exact minimizer of (1) but rather approach a small neighborhood around with a small steady-state bias under both stochastic and deterministic optimization scenarios. For example, in deterministic settings where the individual costs are known, it is shown in [14, 8] that the squared errors generated by the diffusion iterates converge to a -neighborhood. Note that, in this case, this inherent limiting bias is not due to any gradient noise arising from stochastic approximations; it is instead due to the inherent update structure in diffusion and consensus implementations — see the explanations in Sec. III.B in [3]. For stochastic optimization problems, on the other hand, the size of the bias is rather than because of the gradient noise.

When high precision is desired, especially in deterministic optimization problems, it would be preferable to remove the inherent bias. Motivated by these considerations, the works [3, 4] showed that a simple correction step inserted between the adaptation and combination steps (2) and (3) is sufficient to ensure exact convergence of the algorithm to by all agents — see expression (10) further ahead. In this way, the inherent bias is removed completely, and the convergence rate is also improved.

While the correction of the inherent bias is critical in the deterministic setting, it is not clear whether it can help in the stochastic and adaptive settings. This motivates us to study exact diffusion in these settings and compare against standard diffusion. To this end, we carry out a higher-order analysis of the error dynamics for both methods, and derive their steady-state performance in both terms and . In contrast, traditional analysis for diffusion just focuses on the term [7, 8]. Our analysis reveal that the inherent bias in diffusion can be significantly amplified by badly-connected graph topologies, and the bias-correction step in exact diffusion can help address any potential deterioration that occur in these scenarios.

### I-a Our Results

In particular, we will prove in Theorem 1, that, under sufficiently small step-sizes, the exact diffusion will converge exponentially fast, at a rate , to a neighborhood around . Moreover, the size of the neighborhood will be characterized as

(4) |

where the subscript indicates that is generated by the exact diffusion method, the quantity is a measure of the size of gradient noise,

is the second largest magnitude of the eigenvalues of the combination matrix

which reflects the level of the network connectivity, and is the strong convexity constant. In comparison, we will show that the traditional diffusion strategy converges at a similar rate albeit to the following neighborhood:(5) |

where the subscripts indicates that is generated by the diffusion method, and is a bias constant independent of the gradient noise.

Expressions (4) and (5) have the following important implications. First, it is obvious that diffusion suffers from an inherent bias term which is independent of the gradient noise . In contrast, exact diffusion removes this bias. In fact, in the deterministic setting where the gradient noise , it is observed from (4) and (5) that diffusion converges to a -neighborhood around the global solution while exact diffusion converges exactly to . This result is consistent with [14, 8, 15, 4].

Second, it is observed from (4) and (5) that exact diffusion has generally better steady-state mean-square-error performance than diffusion when . The superiority of exact diffusion is more evident when the bias term is significant, which can happen when the bias is large, or the network is sparse or badly-connected (in which case is close to ). Under these scenarios, if the step-size is moderately (but not extremely) small such that

(6) |

where and are constants given in Sec. IV-A, then exact diffusion will perform better than diffusion in steady-state.

Third, the superiority of exact diffusion over diffusion will vanish as step-size approaches . This is because will finally dominate all other terms when is sufficiently small, i.e.,

(7) | ||||

(8) |

The “sufficiently” small can be roughly characterized as , where is any positive constant.

### I-B Related work

In addition to exact diffusion, there exist some other bias-correction methods such as EXTRA[1, 16], ESOM[17], DIGing[18, 2, 19], Aug-DGM[20] and NIDS[21]. All these methods can converge linearly to the exact solution under the deterministic setting, but their performance (especially their advantage over diffusion or consensus) in the stochastic and adaptive settings remain unclear. The recent work [22] extends DIGing to the stochastic and adaptive scenarios and shows its superiority over consensus via numerical simulations. However, it does not analytically discuss when and why bias-correction methods can outperform consensus. Another useful work is [23]

, which establishes the convergence property of exact diffusion for stochastic non-convex cost functions and decaying step-sizes. It proves exact diffusion is less sensitive to the data variance across the network than diffusion and hence endowed with a better convergence rate when such data variance is large. Different from

[23], our bound in (5) clearly shows that even small data variance (i.e., ) can be significantly amplified by a bad network connectivity, see the example graph topologies discussed in Sec. IV-A. This implies the superiority of exact diffusion does not just rely on the robustness to data variance, but, more importantly, to the network connectivity as well. In addition, in contrast to [22, 23] which shows exact diffusion always converges better than diffusion, this paper also clarifies scenarios where exact diffusion and diffusion do have the same performance in steady state.Notation. Throughout the paper we use and to denote a column vector and a diagonal matrix formed from . The notation and

is an identity matrix. The Kronecker product is denoted by “

”. The notation denotes the spectral radius of matrix .## Ii Exact Diffusion Strategy

### Ii-a Exact Diffusion Recursions

The exact diffusion strategy from [3, 4] was originally proposed to solve deterministic optimization problems. We adapt it to solve stochastic optimization problems by replacing the gradient of the local cost by the gradient of the corresponding loss function. That is, we now use:

(9) | ||||

(10) | ||||

(11) |

Observe that the fusion step (11) now employs the corrected iterates from (10) rather than the intermediate iterates from (9). The recursions (9)–(11) can start from any , but we need to set for all in initialization. Note that the weight is different from used in diffusion recursion (3). If we let and denote the combination matrix used in diffusion and exact diffusion respectively, the relation between them is . In the paper, we assume (and hence ) to be symmetric and doubly stochastic.

As explained in [3, 4], exact diffusion is essentially a primal-dual method. We can describe its operation more succinctly by collecting the iterates and gradients from across the network into global vectors. Specifically, we introduce

(12) |

and , then recursions (9) – (11) lead to the second-order recursion

(13) |

We can rewrite this update in a primal-dual form as follows. First, since the combination matrix is symmetric and doubly stochastic, it holds that is positive semi-definite. By decomposing and defining , where is a non-negative diagonal matrix, we know that is also positive semi-definite and . Furthermore, if we let then holds. With these relations, it can be verified^{1}^{1}1To verify it, one can substitute the second recursion in (14) into the first recursion to remove and arrive at (II-A). that recursion (II-A) is equivalent to

(14) |

where plays the role of a dual variable. The analysis in [3, 4] explains how the correction term in (10) guarantees exact convergence to by all agents in deterministic optimization problems where the true gradient is available. In the following sections, we will examine the convergence of exact diffusion (9)–(11) in the stochastic setting.

## Iii Error Dynamics of Exact Diffusion

To establish the error dynamics of exact diffusion, we first introduce several standard assumptions. These assumptions are quite common in the literature (e.g, [7, 8]).

###### Assumption 1 (Conditions on cost functions)

Each is -strongly convex and twice differentiable, and its Hessian matrix satisfies

(15) |

###### Assumption 2 (Conditions on combination matrix)

The network is undirected and strongly connected, and the combination matrix satisfies

(16) |

Assumption 2 implies that is also symmetric and doubly-stochastic. Since the network is strongly connected, it holds that

(17) |

To establish the optimality condition for problem (1), we introduce the following notation:

(18) | ||||

(19) |

where in (18) is the -th block entry of vector . With the above notation, the following lemma from [4] states the optimality condition for problem (1).

###### Lemma 1 (Optimality Condition)

### Iii-a Error Dynamics

We define the gradient noise at agent as

(23) |

and collect them into the network vector

(24) | ||||

(25) |

It then follows that

(26) |

Next, we introduce the error vectors

(27) |

where are optimal solutions satisfying (20)–(21). By combining (14), (20), (21), (26) and (27), we reach

(28) |

Since each is twice-differentiable (see Assumption 1), we can appeal to the mean-value theorem from Lemma D.1 in [8], which allows us to express each difference in (28) in terms of Hessian matrices for any :

where

(29) |

We introduce the block diagonal matrix

(30) |

so that

(31) |

Substituting (31) into the first recursion in (28), we reach

(32) |

Next, if we substitute the first recursion in (32) into the second one, and recall that , we reach the following error dynamics.

### Iii-B Transformed Error Dynamics

The direct convergence analysis of recursion (2) is still challenging. To facilitate the analysis, we identify a convenient change of basis and transform (2) into another equivalent form that is easier to handle. To this end, we introduce a fundamental decomposition from [4] here.

###### Lemma 3 (Fundamental Decomposition)

Under Assumptions 1 and 2, the matrix defined in (2) can be decomposed as

(34) |

where can be any positive constant, and is a diagonal matrix. Moreover, we have

(35) | ||||

(36) | ||||

(37) |

where . Also, the matrix is a diagonal matrix with complex entries. The magnitudes of the diagonal entries are all strictly less than .

By multiplying to both sides of the error dynamics (2) and simplifying we arrive at the following result.

## Iv Mean-square Convergence

Using the transformed error dynamics derived in (4), we can now analyze the mean-square convergence of exact diffusion (9)–(11) in the stochastic and adaptive setting. To begin with, we introduce the filtration

(40) |

The following assumption is standard on the gradient noise process (see [8, 22]

) and is satisfied by linear regression and logistic regression problems.

###### Assumption 3 (Conditions on gradient noise)

It is assumed that the first and second-order conditional moments of the individual gradient noises for any

and satisfy(41) | ||||

(42) |

for some constants and . Moreover, we assume each is independent for any given .

###### Theorem 1 (Mean-Square Convergence)

Proof. See Appendix A.

Theorem 1 indicates that when is smaller than a specified upper bound, the exact diffusion over adaptive networks is stable. The theorem also provides a bound on the size of the steady-state mean-square error. To compare exact diffusion with diffusion, we examine the mean-square convergence property of diffusion as well.

###### Lemma 5 (Mean-square stability of Diffusion)

Under Assumptions 1–3, if satisfies

(47) |

where , and are constants that independent of , , and , then the diffusion recursions (2)–(3) converge exponentially fast to a neighborhood around . The convergence rate is , and the size of the neighborhood can be characterized as follows

(48) |

where is a bias term.

Proof. The arguments are similar to the proof of (46). See Appendix B.

Comparing (46) and (5), it is observed the expressions for both algorithms consist of two major terms – one term and one term. However, diffusion suffers from an additional bias term . In the following, we compare diffusion and exact diffusion in two scenarios.

### Iv-a Bias term is significant

When is large, or the network is sparse, it is possible that the bias term is significant. We assume such bias term in (5) is significant if

(49) |

from which we get Combining with (45), we conclude that if step-size satisfies

(50) |

where and are some constants, then the bias term in (5) is significant and exact diffusion is expected to have better performance than diffusion in steady-state. To make the interval in (50) valid, it is enough to let

(51) |

In the following example, we list several network topologies in which the inherent bias dominates (5) easily.

Example 1 (Linear or Cyclic network). Consider a linear or cyclic network with agents where each node connects with its previous and next neighbors. It is shown in [24] that

(52) |

Therefore, the bias term in diffusion becomes , which increases rapidly with the size of the network.

Example 2 (Grid network). Consider a grid network with agents where each node connects with its neighbors from left, right, behind and front. It is shown in [25] that

(53) |

for grid networks. Therefore, the bias term in diffusion is which also increases with .

### Iv-B Bias term is trivial

In theory, if we adjust to be sufficiently small, the term in both expressions (46) and (5) will eventually dominate for any and . In such scenario, it holds that

(54) | ||||

(55) |

It is observed that both diffusion and exact diffusion will have the same mean-square error order, which implies that diffusion and exact diffusion will perform similarly in this scenario. Such “sufficiently” small step-size can be roughly characterized by the range

(56) |

For example, we can substitute it into (46) to verify

(57) |

and we can also verify with the same technique.

## V Numerical simulation

In this section we compare the performance of exact diffusion and diffusion when solving the decentralized logistic regression problem:

(58) |

where represent the streaming data received by agent . Variable is the feature vector and is the label scalar. In all experiments, we set and . To make the ’s have different minimizers, we first generate different local minimizers . All are normalized so that . At agent , we generate each feature vector . To generate the corresponding label , we generate a random variable . If , we set ; otherwise . The MSD in y-axis indicates mean-square deviation .

In the following, we run two sets of simulations. In the first set, we test the performance of diffusion and exact diffusion over cyclic networks with different size . For each simulation, we fix and compare diffusion and exact diffusion for and . When the size of the cyclic network becomes larger, we know from examples in Sec.IV-A that the inherent bias will increase drastically. In this scenario, we can expect exact diffusion to have better performance in steady-state. The left and middle plots in Figure 1 confirm this conclusion. It is observed that when the network is small, both diffusion and exact diffusion performs almost the same since the inherent bias is trivial. However, as the size increases, the term becomes dominant and exact diffusion is significantly better than diffusion.

In the second set of simulations, we fix the cyclic network size and compare diffusion and exact diffusion at and . As we discussed in Sec. IV-B, the term will gradually dominate all other higher-order terms as . As a result, we can expect diffusion and exact diffusion to match with each other as becomes sufficiently small. The middle and right plots in Figure 1 confirm this conclusion. When , both methods perform almost the same.

## A. Proof of Theorem 1

From the first line in the transformed error dynamics (4), we know that

(59) |

By squaring and taking conditional expectation of both sides of the recursion and recalling (41), we get

(60) |

Next note that

(61) |

where (a) holds for because of Jensen’s inequality, and (b) holds since , , and when . Moreover, equality (c) holds if we choose . In addition, recall (44) that

(62) |

Moreover, we can bound as

(63) |

Substituting (A. Proof of Theorem 1), (62) and (A. Proof of Theorem 1) into (A. Proof of Theorem 1), we reach

(64) |

where the last inequality holds since

(65) |

By taking expectation over the filtration, we get

Comments

There are no comments yet.