I Introduction and Review of [2]
We review briefly the notation and findings from Part I [2] in preparation for examining the challenges that arise in the multiagent scenario. In Part I[2], we considered an optimization problem of the form:
(1) 
where the possibly nondifferentiable but stronglyconvex risk function
was expressed as the expectation of some convex but also possibly nondifferentiable loss function
, namely,(2) 
Here, the letter represents the random data and the expectation operation is over the distribution of this data. The following subgradient algorithm was introduced and studied in Part I[2] for seeking :
(3)  
(4)  
(5) 
with initial conditions , , and . Boldface notation is used for to highlight its stochastic nature since the successive iterates are generated by relying on streaming data realizations for . Moreover, the scalar , where is a number close to one. The term in [3] is an approximate subgradient at location ; it is computed from the data available at time and approximates a true subgradient denoted by . This true subgradient is unavailable since itself is unavailable in the stochastic context. This is because the distribution of the data
is unknown beforehand, which means that the expected loss function cannot be evaluated. The difference between a true subgradient vector and its approximation is gradient noise and is denoted by
(6) 
Ia Data Model and Assumptions
The following three assumptions were motivated in Part I[2]:

is stronglyconvex so that is unique. The strong convexity of means that
(7) for any , and . The above condition is equivalent to requiring [3]:
(8) 
The subgradient is affine Lipschitz, meaning that there exist constants and such that
(9) and for any . Here, the notation denotes the differential at location (i.e., the set of all possible subgradient vectors at ). It was explained in Part I [2] how this affine Lipschitz condition is weaker than conditions used before in the literature and how important cases of interest (such as SVM, LASSO, Total Variation) satisfy it automatically (but do not satisfy the previous conditions). For later use, it is easy to verify (as was done in (50) in Part I[2]) that condition (9) implies that
(10) for any and some constants and .

The first and secondorder moments of the gradient noise process satisfy the conditions:
(11) (12) for some constants and , and where the notation denotes the filtration (collection) corresponding to all past iterates:
(13) It was again shown in Part I[2] how the gradient noise process in important applications (e.g., SVM,LASSO) satisfy (11)—(12) directly.
Under the three conditions 1) — 3), which are automatically satisfied for important cases of interest, the following important conclusion was proven in Part I [2] for the stochastic subgradient algorithm (3)–(5) above. At every iteration , it will hold that
(14) 
where the convergence of to occurs at an exponential rate where .
IB Interpretation of Result
For the benefit of the reader, we repeat here the interpretation that was given in Sec. IV.D of Part I [2] for the key results (14); these remarks will be relevant in the networked case and are therefore useful to highlight again:

First, it has been observed in the optimization literature[4, 3, 5] that subgradient descent iterations can perform poorly in deterministic problems (where is known). Their convergence rate is under convexity and under strongconvexity when decaying stepsizes, , are used to ensure convergence [5]. Result (14) shows that the situation is different in the context of stochastic optimization when true subgradients are approximated from streaming data due to different requirements. By using constant stepsizes to enable continuous learning and adaptation, the subgradient iteration is now able to achieve exponential convergence at the rate of to steadystate.

Second, this substantial improvement in convergence rate comes at a cost, but one that is acceptable and controllable. Specifically, we cannot guarantee convergence of the algorithm to the global minimum value, , anymore but can instead approach this optimal value with high accuracy in the order of , where the size of is under the designer’s control and can be selected as small as desired.

Third, this performance level is sufficient in most cases of interest because, in practice, one rarely has an infinite amount of data and, moreover, the data is often subject to distortions not captured by any assumed models. It is increasingly recognized in the literature that it is not always necessary to ensure exact convergence towards the optimal solution, , or the minimum value, , because these optimal values may not reflect accurately the true state due to modeling errors. For example, it is explained in the works [6, 7, 3, 8] that it is generally unnecessary to reduce the error measures below the statistical error level that is present in the data.
IC This Work
The purpose of this work is to examine how these properties reveal themselves in the networked case when a multitude of interconnected agents cooperate to minimize an aggregate cost function that is not generally smooth. In this case, it is necessary to examine closely the effect of the coupled dynamics and whether agents will still be able to agree fast enough under nondifferentiability.
Distributed learning under nonsmooth risk functions is common in many applications including distributed estimation and distributed machine learning. For example,
regularization or hingeloss functions (as in SVM implementations) lead to nonsmooth risks. Several useful techniques have been developed in the literature for the solution of such distributed optimization problems, including the use of consensus strategies [9, 10, 11] and diffusion strategies [12, 13, 14, 15]. In this paper, we will focus on the AdaptthenCombine (ATC) diffusion strategy mainly because diffusion strategies have been shown to have superior meansquareerror and stability performance in adaptive scenarios where agents are expected to continually learn from streaming data[15]. In particular, we shall examine the performance and stability behavior of networked diffusion learning under weaker conditions than previously considered in the literature. It is true that there have been several useful studies that employed subgradient constructions in the distributed setting before, most notably[16, 9, 17]. However, these earlier works generally assume bounded subgradients. As was already explained in Part I [2], this is a serious limitation (which does not hold even for quadratic risks where the gradient vector is linear in and grows unbounded). Instead, we shall consider the weaker affine Lipschitz condition (9), which was shown in Part I [2] to be satisfied automatically by important risk functions such as those arising in popular quadratic, SVM, and LASSO formulations.Notation
: We use lowercase letters to denote vectors, uppercase letters for matrices, plain letters for deterministic variables, and boldface letters for random variables. We also use
to denote transposition, for matrix inversion, for the trace of a matrix,for the eigenvalues of a matrix,
for the 2norm of a matrix or the Euclidean norm of a vector, and for the spectral radius of a matrix. Besides, we use to denote that is positive semidefinite, and to denote that all entries of vector are positive.Ii Problem Formulation: MultiAgent Case
We now extend the single agent scenario analysis to multiagent networks where a collection of agents cooperate with each other to seek the minimizer of a weighted aggregate cost of the form:
(15) 
where refers to the agent index and is some positive weighting coefficient added for generality. When the are uniform and equal to each other, then (15) amounts to minimizing the aggregate sum of the individual risks . We can assume, without loss in generality, that the weights are normalized to add up to one
(16) 
Each individual risk function continues to be expressed as the expected value of some loss function:
(17) 
Here, the letter represents the random data at agent
and the expectation is over the distribution of this data. Many problems in adaptation and learning involve risk functions of this form, including, for example, meansquareerror designs and support vector machine (SVM) solutions — see, e.g.,
[18, 19, 20]. We again allow each risk function to be nondifferentiable. This situation is common in machine learning formulations, e.g., in SVM costs and in regularized sparsityinducing formulations.We continue to assume that the individual costs satisfy Assumptions 1 and 2 described in the introduction section, namely, conditions (8), (9), and (10), which ensure that each is stronglyconvex and its subgradient vectors are affineLipschitz with parameters ; we are attaching a subscript to these parameters to make them agentdependent (alternatively, if desired, we can replace them by agentindependent parameters by using bounds on their values).
Iia Network Model
We consider a network consisting of separate agents connected by a topology. As described in [21, 12], we assign a pair of nonnegative weights, , to the edge connecting any two agents and . The scalar is used by agent to scale the data it receives from agent and similarly for . The network is said to be connected if paths with nonzero scaling weights can be found linking any two distinct agents in both directions. The network is said to be strongly–connected if it is connected with at least one selfloop, meaning that for some agent . Figure 1 shows one example of a strongly–connected network. For emphasis in this figure, each edge between two neighboring agents is represented by two directed arrows. The neighborhood of any agent is denoted by and it consists of all agents that are connected to by edges; we assume by default that this set includes agent regardless of whether agent has a selfloop or not.
There are several strategies that the agents can employ to seek the minimizer, , including consensus and diffusion strategies [11, 9, 10, 21, 12]. As noted earlier, in this work, we focus on the latter class since diffusion implementations have been shown to have superior stability and performance properties over consensus strategies when used in the context of adaptation and learning from streaming data (i.e., when the stepsizes are set to a constant value as opposed to a diminishing value) [21, 12, 15]. Although diminishing stepsizes annihilate the gradient noise term they, nevertheless, disable adaptation and learning in the long run. In comparison, constant stepsize updates keep adaptation alive, but they allow gradient noise to seep into the operation of the algorithm. The challenge in these scenarios is therefore to show that the dynamics of the diffusion strategy over the network is such that the gradient noise effect does not degrade performance and that the network will be able to learn the unknown. This kind of analysis has been answered before in the affirmative for smooth twicedifferentiable functions, — see [21, 12, 13, 14]. In this work, we want to pursue the analysis more generally for possibly nondifferentiable risks in order to encompass important applications (such as SVM learning by multiagents or LASSO and sparsityaware learning by similar agents[22, 23, 24, 25]). We also want to pursue the analysis under the weaker affineLipschitz assumption (9) on the subgradients than the stronger conditions used in the prior literature, as we already explained in the earlier sections and in Part I [2].
IiB Diffusion Strategy
We consider the following diffusion strategy in its adaptthencombine (ATC) form:
(18) 
Here, the first step involves adaptation by agent by using a stochastic subgradient iteration, while the second step involves aggregation; we assume the gradient noise processes across all agents are independent of each other. The entries
define a leftstochastic matrix, namely, the entries of
are nonnegative and each of its columns adds up to one. Since the network is stronglyconnected, the combination matrix will be primitive [26, 21]. This implies that will admit a Jordandecomposition of the form:(19) 
with a single eigenvalue at one and all other eigenvalues strictly inside the unit circle. The matrix has a Jordan structure with the ones that would typically appear along its first subdiagonal replaced by a small positive number,
. Note that the eigenvectors of
corresponding to the eigenvalue at one are denoted by(20) 
where refers to a column vector with all its entries equal to one. It is further known from the PerronFrobenius theorem [26] that the entries of are all strictly positive; we normalize them to add up to one. We denote the individual entries of by so that:
(21) 
Furthermore, since , it holds that
(22) 
Next, we introduce the vector
(23) 
where is the weight associated with in (15). Since the designer is free to select the stepsize parameters, it turns out that we can always relate the vectors in the following manner:
(24) 
for some constant . Note, for instance, that for (24) to be valid the scalar should satisfy for all . To make this expression for independent of , we may parameterize (select) the stepsizes as
(25) 
for some small . Then, , which is independent of and relation (24) is satisfied. Using (16) and (24) it is easy to check that
(26) 
Note that since the are positive, smaller than one, and their sum is one, the above expression shows that can be interpreted as a weighted average stepsize parameter.
Iii Network Performance
We are now ready to extend Theorem 1 from Part I[2] to the network case. The analysis is more challenging due to the coupling among the agents. But the result will establish that the distributed strategy is stable and converges exponentially fast for sufficiently small stepsizes. As was the case with Part I [2], the statement below is again in terms of pocket variables, which we define as follows.
At every iteration , the risk value that is attained by iterate is . This value is a random variable due to the randomness in the streaming data used to run the algorithm. We denote the mean risk value at agent by . We again introduce a best pocket iterate, denoted by . At any iteration , the value that is saved in this pocket variable is the iterate that has generated the smallest mean risk value up to time , i.e.,
(27) 
Observe that in the network case we now have pocket values, one for each agent.
Theorem 1 (Network performance)
Consider using the stochastic subgradient diffusion algorithm (18) to seek the unique minimizer, , of the optimization problem (15), where the risk functions, , are assumed to satisfy assumptions (8), (10), and (12) with parameters . Assume the stepsize parameter is sufficiently small (see condition (111)). Then, it holds that
The convergence of towards a neighborhood of size around occurs at an exponential rate, , dictated by the parameter
(29) 
Condition (111) further ahead ensures .
: The argument is provided in Appendix A.
The above theorem clarifies the performance of the network in terms of the best pocket values across the agents. However, these pocket values are not readily available because the risk values, , cannot be evaluated. This is due to the fact that the statistical properties of the data are not known beforehand. As was the case with the singleagent scenario in Part I [2], a more practical conclusion can be deduced from the statement of the theorem as follows. We again introduce the geometric sum:
(30) 
as well as the normalized and convexcombination coefficients:
(31) 
Using these coefficients, we define a weighted iterate at each agent:
(32) 
and observe that satisfies the recursive construction:
(33) 
In particular, as , we have , and the above recursion simplifies in the limit to
(34) 
Corollary 1 (Weighted iterates)
Under the same conditions as in Theorem 1, it holds that
(35) 
and convergence continues to occur at the same exponential rate, .
The argument is provided in Appendix D.
Result (35) is an interesting conclusion. However, the statement is in terms of the averaged iterate whose computation requires knowledge of . This latter parameter is a global information, which is not readily available to all agents. Nevertheless, result (35) motivates the following useful distributed implementation with a similar guaranteed performance bound. We can replace by a design parameter, , that is no less than but still smaller than one, i.e., . Next, we introduce the weighted variable:
(36) 
where now
(37) 
and
(38) 
Corollary 2 (Distributed Weighted iterates)
The argument is similar to the proof of Corollary 2 from Part I [2].
For ease of reference, we summarize in the table below the
listing of the stochastic subgradient learning algorithm with
exponential smoothing for which Corollaries 1 and 2 hold.
Diffusion stochastic subgradient with exponential smoothing
Initialization:
repeat for :
for each agent :
(39)  
(40)  
(41)  
(42) 
end
end
Iiia Interpretation of Results
Examining the bound in (35), and comparing it with result (88) from Part I[2] for the singleagent case, we observe that the topology of the network is now reflected in the bound through the weighting factor, and stepsize , which can be related to the Perron entry through (25). Recall from (20) that the are the entries of the righteigenvector of corresponding to the eigenvalue at one. Moreover, the bound in (35)involves three terms (rather than only two as in the singleagent case — compared with (88) from Part I[2]):

, which arises from the nonsmoothness of the risk function;

, which is due to gradient noise and the approximation of the true subgradient vector;

, which is an extra term in comparison to the single agent case. We explained in (93) that the value of is related to how far the error at each agent is away from the weighted average error across the network. As for , this quantity represents the disagreement among the agents over . Because each function may have a different minimizer, is generally nonzero.
Iv Simulations
Example 1 (Multiagent LASSO problem) We now consider the LASSO problem with 20 agents connected according to Fig. 2. A quick review of the LASSO problem is as follows. (A more detailed discussion and the relationship between the proposed assumptions (8)–(10) and the LASSO formulation can be found in Part I [2].) We consider follwing cost function for each agent:
(43) 
where is a regularization parameter and denotes the norm of . The variable plays the role of a desired signal for agent , while
plays the role of a regression vector for the same agent. It is assumed that the regression data are zeromean widesense stationary, and its distribution satisfies the standard Gaussian distribution, i.e.,
. We further assume that satisfy a linear model of the form generated through:(44) 
where and is some sparse random model for each agent. Each agent is allowed to have different regression and noise powers, as illustrated in Fig. 3. Under these modeling assumptions, we can determine a closedform expression for as follows:
(45) 
From firstorder optimality conditions, we obtain[27]:
(46) 
where the symbol represents the softthresholding function with parameter , i.e.,
(47) 
and
(48) 
where the notation , for a scalar , refers to the sign function:
(49) 
For the stochastic subgradient implementation, the following instantaneous approximation for the subgradient is employed:
(50) 
In Fig. 4, we compare the performance of this solution against several strategies including standard diffusion LMS[28, 21, 12]:
(51) 
and sparse diffusion LMS[22, 24, 25] [23, Eq. 21].
Diffusion sparse LMS with expoential smoothing
Initialization:
repeat for :
for each agent :
(52)  
(53)  
(54)  
(55) 
end
end
The parameter setting is as follows:
has 5 random nonzero entries uniformly distributed between 0.5 and 1.5, and
. We simply let and set the stepsize for all agents at . From the simulations we find for the factor that appears in (LABEL:gj8923.d). As for the exponential smoothing factor , we chose .
Example 2 (Multiagent SVM learning) Next, we will consider the multiagent SVM problem. Similar to LASSO problem, we provide a brief review for notation. More detailed discussion can be found in Part I [2]. The regularized SVM risk function for each agent is of the form:
(56) 
where is a regularization parameter. We are generally given a collection of independent training data, , consisting of feature vectors and their class designations. We select and
(57) 
One approximation for the subgradient construction at a generic location corresponding to generic data is
(58) 
where the indicator function is defined as follows:
(59) 
Diffusion SVM with exponential smoothing
Initialization:
repeat for :
for each agent :
(60)  
(61)  
(62)  
(63) 
end
end
We distribute 32561 training data from an adult dataset^{1}^{1}1https://archive.ics.uci.edu/ml/datasets/Adult over a network consisting of 20 agents. We set and for all agents. From Example 6 in Part I [2] and Theorem 1, we know that for the multiagent SVM problem:
(64)  
We set , which usually guarantees . Fig. 5 (left) shows that cooperation among the agents outperforms the noncooperative solution. Moreover, the distributed network can almost match the performance of the centralized LIBSVM solution[29]. We also examined the RCV1 dataset^{2}^{2}2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Here we have 20242 training data points and we distribute them over 20 agents. We set the parameters to and (due to limited data). We now use since is not that small. The result is shown in Fig. 5 (right).
V Conclusion
In summary, we examined the performance of stochastic subgradient learning strategies over adaptive networks. We proposed a new affineLipschitz condition, which is quite suitable for strongly convex but nondifferentiable cost functions and is automatically satisfied by several important cases including SVM, LASSO, TotalVariation denoising, etc. Under this weaker condition, the analysis establishes that subgradient strategies can attain exponential convergence rates, as opposed to sublinear rates. The analysis also establishes that these strategies can approach the optimal solution within , for sufficiently small stepsizes.
Appendix A Proof of theorem 1
Introduce the error vector, . We collect the iterates and the respective errors from across the network into block column vectors:
(65)  
(66) 
We also define the extended quantities:
(67)  
(68)  
(69)  
(70)  
(71) 
where denotes the Kronecker product operation, and denotes the gradient noise at agent . Using this notation, it is straightforward to verify that the network error vector generated by the diffusion strategy (18) evolves according to the following dynamics:
(72) 
Motivated by the treatment of the smooth case in [21, 13, 14], we introduce a useful change of variables. Let and . Multiplying (72) from the left by gives
(73) 
where from (19):
(74) 
and
(75) 
To proceed, we introduce
(77)  
(78) 
where the quantities amount to the weighted averages:
(79)  
(80)  
(81) 
It is useful to observe the asymmetry reflected in the fact that is obtained by using the weights while the averages (77)–(78) are obtained by using the weights . We can now rewrite (73) as
Comments
There are no comments yet.