I Introduction and Review of 
We review briefly the notation and findings from Part I  in preparation for examining the challenges that arise in the multi-agent scenario. In Part I, we considered an optimization problem of the form:
where the possibly non-differentiable but strongly-convex risk function
was expressed as the expectation of some convex but also possibly non-differentiable loss function, namely,
Here, the letter represents the random data and the expectation operation is over the distribution of this data. The following sub-gradient algorithm was introduced and studied in Part I for seeking :
with initial conditions , , and . Boldface notation is used for to highlight its stochastic nature since the successive iterates are generated by relying on streaming data realizations for . Moreover, the scalar , where is a number close to one. The term in  is an approximate sub-gradient at location ; it is computed from the data available at time and approximates a true sub-gradient denoted by . This true sub-gradient is unavailable since itself is unavailable in the stochastic context. This is because the distribution of the data
is unknown beforehand, which means that the expected loss function cannot be evaluated. The difference between a true sub-gradient vector and its approximation is gradient noise and is denoted by
I-a Data Model and Assumptions
The following three assumptions were motivated in Part I:
is strongly-convex so that is unique. The strong convexity of means that
for any , and . The above condition is equivalent to requiring :
The subgradient is affine Lipschitz, meaning that there exist constants and such that
and for any . Here, the notation denotes the differential at location (i.e., the set of all possible subgradient vectors at ). It was explained in Part I  how this affine Lipschitz condition is weaker than conditions used before in the literature and how important cases of interest (such as SVM, LASSO, Total Variation) satisfy it automatically (but do not satisfy the previous conditions). For later use, it is easy to verify (as was done in (50) in Part I) that condition (9) implies that
for any and some constants and .
The first and second-order moments of the gradient noise process satisfy the conditions:
for some constants and , and where the notation denotes the filtration (collection) corresponding to all past iterates:
Under the three conditions 1) — 3), which are automatically satisfied for important cases of interest, the following important conclusion was proven in Part I  for the stochastic subgradient algorithm (3)–(5) above. At every iteration , it will hold that
where the convergence of to occurs at an exponential rate where .
I-B Interpretation of Result
For the benefit of the reader, we repeat here the interpretation that was given in Sec. IV.D of Part I  for the key results (14); these remarks will be relevant in the networked case and are therefore useful to highlight again:
First, it has been observed in the optimization literature[4, 3, 5] that sub-gradient descent iterations can perform poorly in deterministic problems (where is known). Their convergence rate is under convexity and under strong-convexity when decaying step-sizes, , are used to ensure convergence . Result (14) shows that the situation is different in the context of stochastic optimization when true subgradients are approximated from streaming data due to different requirements. By using constant step-sizes to enable continuous learning and adaptation, the sub-gradient iteration is now able to achieve exponential convergence at the rate of to steady-state.
Second, this substantial improvement in convergence rate comes at a cost, but one that is acceptable and controllable. Specifically, we cannot guarantee convergence of the algorithm to the global minimum value, , anymore but can instead approach this optimal value with high accuracy in the order of , where the size of is under the designer’s control and can be selected as small as desired.
Third, this performance level is sufficient in most cases of interest because, in practice, one rarely has an infinite amount of data and, moreover, the data is often subject to distortions not captured by any assumed models. It is increasingly recognized in the literature that it is not always necessary to ensure exact convergence towards the optimal solution, , or the minimum value, , because these optimal values may not reflect accurately the true state due to modeling errors. For example, it is explained in the works [6, 7, 3, 8] that it is generally unnecessary to reduce the error measures below the statistical error level that is present in the data.
I-C This Work
The purpose of this work is to examine how these properties reveal themselves in the networked case when a multitude of interconnected agents cooperate to minimize an aggregate cost function that is not generally smooth. In this case, it is necessary to examine closely the effect of the coupled dynamics and whether agents will still be able to agree fast enough under non-differentiability.
: We use lowercase letters to denote vectors, uppercase letters for matrices, plain letters for deterministic variables, and boldface letters for random variables. We also useto denote transposition, for matrix inversion, for the trace of a matrix,
for the eigenvalues of a matrix,for the 2-norm of a matrix or the Euclidean norm of a vector, and for the spectral radius of a matrix. Besides, we use to denote that is positive semi-definite, and to denote that all entries of vector are positive.
Ii Problem Formulation: Multi-Agent Case
We now extend the single agent scenario analysis to multi-agent networks where a collection of agents cooperate with each other to seek the minimizer of a weighted aggregate cost of the form:
where refers to the agent index and is some positive weighting coefficient added for generality. When the are uniform and equal to each other, then (15) amounts to minimizing the aggregate sum of the individual risks . We can assume, without loss in generality, that the weights are normalized to add up to one
Each individual risk function continues to be expressed as the expected value of some loss function:
Here, the letter represents the random data at agent
and the expectation is over the distribution of this data. Many problems in adaptation and learning involve risk functions of this form, including, for example, mean-square-error designs and support vector machine (SVM) solutions — see, e.g.,[18, 19, 20]. We again allow each risk function to be non-differentiable. This situation is common in machine learning formulations, e.g., in SVM costs and in regularized sparsity-inducing formulations.
We continue to assume that the individual costs satisfy Assumptions 1 and 2 described in the introduction section, namely, conditions (8), (9), and (10), which ensure that each is strongly-convex and its sub-gradient vectors are affine-Lipschitz with parameters ; we are attaching a subscript to these parameters to make them agent-dependent (alternatively, if desired, we can replace them by agent-independent parameters by using bounds on their values).
Ii-a Network Model
We consider a network consisting of separate agents connected by a topology. As described in [21, 12], we assign a pair of nonnegative weights, , to the edge connecting any two agents and . The scalar is used by agent to scale the data it receives from agent and similarly for . The network is said to be connected if paths with nonzero scaling weights can be found linking any two distinct agents in both directions. The network is said to be strongly–connected if it is connected with at least one self-loop, meaning that for some agent . Figure 1 shows one example of a strongly–connected network. For emphasis in this figure, each edge between two neighboring agents is represented by two directed arrows. The neighborhood of any agent is denoted by and it consists of all agents that are connected to by edges; we assume by default that this set includes agent regardless of whether agent has a self-loop or not.
There are several strategies that the agents can employ to seek the minimizer, , including consensus and diffusion strategies [11, 9, 10, 21, 12]. As noted earlier, in this work, we focus on the latter class since diffusion implementations have been shown to have superior stability and performance properties over consensus strategies when used in the context of adaptation and learning from streaming data (i.e., when the step-sizes are set to a constant value as opposed to a diminishing value) [21, 12, 15]. Although diminishing step-sizes annihilate the gradient noise term they, nevertheless, disable adaptation and learning in the long run. In comparison, constant step-size updates keep adaptation alive, but they allow gradient noise to seep into the operation of the algorithm. The challenge in these scenarios is therefore to show that the dynamics of the diffusion strategy over the network is such that the gradient noise effect does not degrade performance and that the network will be able to learn the unknown. This kind of analysis has been answered before in the affirmative for smooth twice-differentiable functions, — see [21, 12, 13, 14]. In this work, we want to pursue the analysis more generally for possibly non-differentiable risks in order to encompass important applications (such as SVM learning by multi-agents or LASSO and sparsity-aware learning by similar agents[22, 23, 24, 25]). We also want to pursue the analysis under the weaker affine-Lipschitz assumption (9) on the sub-gradients than the stronger conditions used in the prior literature, as we already explained in the earlier sections and in Part I .
Ii-B Diffusion Strategy
We consider the following diffusion strategy in its adapt-then-combine (ATC) form:
Here, the first step involves adaptation by agent by using a stochastic sub-gradient iteration, while the second step involves aggregation; we assume the gradient noise processes across all agents are independent of each other. The entries
define a left-stochastic matrix, namely, the entries ofare non-negative and each of its columns adds up to one. Since the network is strongly-connected, the combination matrix will be primitive [26, 21]. This implies that will admit a Jordan-decomposition of the form:
with a single eigenvalue at one and all other eigenvalues strictly inside the unit circle. The matrix has a Jordan structure with the ones that would typically appear along its first sub-diagonal replaced by a small positive number,
. Note that the eigenvectors ofcorresponding to the eigenvalue at one are denoted by
where refers to a column vector with all its entries equal to one. It is further known from the Perron-Frobenius theorem  that the entries of are all strictly positive; we normalize them to add up to one. We denote the individual entries of by so that:
Furthermore, since , it holds that
Next, we introduce the vector
where is the weight associated with in (15). Since the designer is free to select the step-size parameters, it turns out that we can always relate the vectors in the following manner:
for some constant . Note, for instance, that for (24) to be valid the scalar should satisfy for all . To make this expression for independent of , we may parameterize (select) the step-sizes as
Note that since the are positive, smaller than one, and their sum is one, the above expression shows that can be interpreted as a weighted average step-size parameter.
Iii Network Performance
We are now ready to extend Theorem 1 from Part I to the network case. The analysis is more challenging due to the coupling among the agents. But the result will establish that the distributed strategy is stable and converges exponentially fast for sufficiently small step-sizes. As was the case with Part I , the statement below is again in terms of pocket variables, which we define as follows.
At every iteration , the risk value that is attained by iterate is . This value is a random variable due to the randomness in the streaming data used to run the algorithm. We denote the mean risk value at agent by . We again introduce a best pocket iterate, denoted by . At any iteration , the value that is saved in this pocket variable is the iterate that has generated the smallest mean risk value up to time , i.e.,
Observe that in the network case we now have pocket values, one for each agent.
Theorem 1 (Network performance)
Consider using the stochastic sub-gradient diffusion algorithm (18) to seek the unique minimizer, , of the optimization problem (15), where the risk functions, , are assumed to satisfy assumptions (8), (10), and (12) with parameters . Assume the step-size parameter is sufficiently small (see condition (111)). Then, it holds that
The convergence of towards a neighborhood of size around occurs at an exponential rate, , dictated by the parameter
Condition (111) further ahead ensures .
: The argument is provided in Appendix A.
The above theorem clarifies the performance of the network in terms of the best pocket values across the agents. However, these pocket values are not readily available because the risk values, , cannot be evaluated. This is due to the fact that the statistical properties of the data are not known beforehand. As was the case with the single-agent scenario in Part I , a more practical conclusion can be deduced from the statement of the theorem as follows. We again introduce the geometric sum:
as well as the normalized and convex-combination coefficients:
Using these coefficients, we define a weighted iterate at each agent:
and observe that satisfies the recursive construction:
In particular, as , we have , and the above recursion simplifies in the limit to
Corollary 1 (Weighted iterates)
Under the same conditions as in Theorem 1, it holds that
and convergence continues to occur at the same exponential rate, .
The argument is provided in Appendix D.
Result (35) is an interesting conclusion. However, the statement is in terms of the averaged iterate whose computation requires knowledge of . This latter parameter is a global information, which is not readily available to all agents. Nevertheless, result (35) motivates the following useful distributed implementation with a similar guaranteed performance bound. We can replace by a design parameter, , that is no less than but still smaller than one, i.e., . Next, we introduce the weighted variable:
Corollary 2 (Distributed Weighted iterates)
The argument is similar to the proof of Corollary 2 from Part I .
Diffusion stochastic subgradient with exponential smoothing
repeat for :
for each agent :
Iii-a Interpretation of Results
Examining the bound in (35), and comparing it with result (88) from Part I for the single-agent case, we observe that the topology of the network is now reflected in the bound through the weighting factor, and step-size , which can be related to the Perron entry through (25). Recall from (20) that the are the entries of the right-eigenvector of corresponding to the eigenvalue at one. Moreover, the bound in (35)involves three terms (rather than only two as in the single-agent case — compared with (88) from Part I):
, which arises from the non-smoothness of the risk function;
, which is due to gradient noise and the approximation of the true sub-gradient vector;
, which is an extra term in comparison to the single agent case. We explained in (93) that the value of is related to how far the error at each agent is away from the weighted average error across the network. As for , this quantity represents the disagreement among the agents over . Because each function may have a different minimizer, is generally nonzero.
Example 1 (Multi-agent LASSO problem) We now consider the LASSO problem with 20 agents connected according to Fig. 2. A quick review of the LASSO problem is as follows. (A more detailed discussion and the relationship between the proposed assumptions (8)–(10) and the LASSO formulation can be found in Part I .) We consider follwing cost function for each agent:
where is a regularization parameter and denotes the norm of . The variable plays the role of a desired signal for agent , while
plays the role of a regression vector for the same agent. It is assumed that the regression data are zero-mean wide-sense stationary, and its distribution satisfies the standard Gaussian distribution, i.e.,. We further assume that satisfy a linear model of the form generated through:
where and is some sparse random model for each agent. Each agent is allowed to have different regression and noise powers, as illustrated in Fig. 3. Under these modeling assumptions, we can determine a closed-form expression for as follows:
From first-order optimality conditions, we obtain:
where the symbol represents the soft-thresholding function with parameter , i.e.,
where the notation , for a scalar , refers to the sign function:
For the stochastic sub-gradient implementation, the following instantaneous approximation for the sub-gradient is employed:
Diffusion sparse LMS with expoential smoothing
repeat for :
for each agent :
The parameter setting is as follows:
has 5 random non-zero entries uniformly distributed between 0.5 and 1.5, and. We simply let and set the step-size for all agents at . From the simulations we find for the factor that appears in (LABEL:gj8923.d). As for the exponential smoothing factor , we chose .
Example 2 (Multi-agent SVM learning) Next, we will consider the multi-agent SVM problem. Similar to LASSO problem, we provide a brief review for notation. More detailed discussion can be found in Part I . The regularized SVM risk function for each agent is of the form:
where is a regularization parameter. We are generally given a collection of independent training data, , consisting of feature vectors and their class designations. We select and
One approximation for the sub-gradient construction at a generic location corresponding to generic data is
where the indicator function is defined as follows:
Diffusion SVM with exponential smoothing
repeat for :
for each agent :
We distribute 32561 training data from an adult dataset111https://archive.ics.uci.edu/ml/datasets/Adult over a network consisting of 20 agents. We set and for all agents. From Example 6 in Part I  and Theorem 1, we know that for the multi-agent SVM problem:
We set , which usually guarantees . Fig. 5 (left) shows that cooperation among the agents outperforms the non-cooperative solution. Moreover, the distributed network can almost match the performance of the centralized LIBSVM solution. We also examined the RCV1 dataset222https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Here we have 20242 training data points and we distribute them over 20 agents. We set the parameters to and (due to limited data). We now use since is not that small. The result is shown in Fig. 5 (right).
In summary, we examined the performance of stochastic sub-gradient learning strategies over adaptive networks. We proposed a new affine-Lipschitz condition, which is quite suitable for strongly convex but non-differentiable cost functions and is automatically satisfied by several important cases including SVM, LASSO, Total-Variation denoising, etc. Under this weaker condition, the analysis establishes that sub-gradient strategies can attain exponential convergence rates, as opposed to sub-linear rates. The analysis also establishes that these strategies can approach the optimal solution within , for sufficiently small step-sizes.
Appendix A Proof of theorem 1
Introduce the error vector, . We collect the iterates and the respective errors from across the network into block column vectors:
We also define the extended quantities:
where denotes the Kronecker product operation, and denotes the gradient noise at agent . Using this notation, it is straightforward to verify that the network error vector generated by the diffusion strategy (18) evolves according to the following dynamics:
where from (19):
To proceed, we introduce
where the quantities amount to the weighted averages: