Performance Limits of Stochastic Sub-Gradient Learning, Part II: Multi-Agent Case

04/20/2017 ∙ by Bicheng Ying, et al. ∙ 0

The analysis in Part I revealed interesting properties for subgradient learning algorithms in the context of stochastic optimization when gradient noise is present. These algorithms are used when the risk functions are non-smooth and involve non-differentiable components. They have been long recognized as being slow converging methods. However, it was revealed in Part I that the rate of convergence becomes linear for stochastic optimization problems, with the error iterate converging at an exponential rate α^i to within an O(μ)-neighborhood of the optimizer, for some α∈ (0,1) and small step-size μ. The conclusion was established under weaker assumptions than the prior literature and, moreover, several important problems (such as LASSO, SVM, and Total Variation) were shown to satisfy these weaker assumptions automatically (but not the previously used conditions from the literature). These results revealed that sub-gradient learning methods have more favorable behavior than originally thought when used to enable continuous adaptation and learning. The results of Part I were exclusive to single-agent adaptation. The purpose of the current Part II is to examine the implications of these discoveries when a collection of networked agents employs subgradient learning as their cooperative mechanism. The analysis will show that, despite the coupled dynamics that arises in a networked scenario, the agents are still able to attain linear convergence in the stochastic case; they are also able to reach agreement within O(μ) of the optimizer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Review of [2]

We review briefly the notation and findings from Part I [2] in preparation for examining the challenges that arise in the multi-agent scenario. In Part I[2], we considered an optimization problem of the form:

(1)

where the possibly non-differentiable but strongly-convex risk function

was expressed as the expectation of some convex but also possibly non-differentiable loss function

, namely,

(2)

Here, the letter represents the random data and the expectation operation is over the distribution of this data. The following sub-gradient algorithm was introduced and studied in Part I[2] for seeking :

(3)
(4)
(5)

with initial conditions , , and . Boldface notation is used for to highlight its stochastic nature since the successive iterates are generated by relying on streaming data realizations for . Moreover, the scalar , where is a number close to one. The term in [3] is an approximate sub-gradient at location ; it is computed from the data available at time and approximates a true sub-gradient denoted by . This true sub-gradient is unavailable since itself is unavailable in the stochastic context. This is because the distribution of the data

is unknown beforehand, which means that the expected loss function cannot be evaluated. The difference between a true sub-gradient vector and its approximation is gradient noise and is denoted by

(6)

I-a Data Model and Assumptions

The following three assumptions were motivated in Part I[2]:

  1. is strongly-convex so that is unique. The strong convexity of means that

    (7)

    for any , and . The above condition is equivalent to requiring [3]:

    (8)
  2. The subgradient is affine Lipschitz, meaning that there exist constants and such that

    (9)

    and for any . Here, the notation denotes the differential at location (i.e., the set of all possible subgradient vectors at ). It was explained in Part I [2] how this affine Lipschitz condition is weaker than conditions used before in the literature and how important cases of interest (such as SVM, LASSO, Total Variation) satisfy it automatically (but do not satisfy the previous conditions). For later use, it is easy to verify (as was done in (50) in Part I[2]) that condition (9) implies that

    (10)

    for any and some constants and .

  3. The first and second-order moments of the gradient noise process satisfy the conditions:

    (11)
    (12)

    for some constants and , and where the notation denotes the filtration (collection) corresponding to all past iterates:

    (13)

    It was again shown in Part I[2] how the gradient noise process in important applications (e.g., SVM,LASSO) satisfy (11)—(12) directly.

Under the three conditions 1) — 3), which are automatically satisfied for important cases of interest, the following important conclusion was proven in Part I [2] for the stochastic subgradient algorithm (3)–(5) above. At every iteration , it will hold that

(14)

where the convergence of to occurs at an exponential rate where .

I-B Interpretation of Result

For the benefit of the reader, we repeat here the interpretation that was given in Sec. IV.D of Part I [2] for the key results (14); these remarks will be relevant in the networked case and are therefore useful to highlight again:

  1. First, it has been observed in the optimization literature[4, 3, 5] that sub-gradient descent iterations can perform poorly in deterministic problems (where is known). Their convergence rate is under convexity and under strong-convexity when decaying step-sizes, , are used to ensure convergence [5]. Result (14) shows that the situation is different in the context of stochastic optimization when true subgradients are approximated from streaming data due to different requirements. By using constant step-sizes to enable continuous learning and adaptation, the sub-gradient iteration is now able to achieve exponential convergence at the rate of to steady-state.

  2. Second, this substantial improvement in convergence rate comes at a cost, but one that is acceptable and controllable. Specifically, we cannot guarantee convergence of the algorithm to the global minimum value, , anymore but can instead approach this optimal value with high accuracy in the order of , where the size of is under the designer’s control and can be selected as small as desired.

  3. Third, this performance level is sufficient in most cases of interest because, in practice, one rarely has an infinite amount of data and, moreover, the data is often subject to distortions not captured by any assumed models. It is increasingly recognized in the literature that it is not always necessary to ensure exact convergence towards the optimal solution, , or the minimum value, , because these optimal values may not reflect accurately the true state due to modeling errors. For example, it is explained in the works [6, 7, 3, 8] that it is generally unnecessary to reduce the error measures below the statistical error level that is present in the data.

I-C This Work

The purpose of this work is to examine how these properties reveal themselves in the networked case when a multitude of interconnected agents cooperate to minimize an aggregate cost function that is not generally smooth. In this case, it is necessary to examine closely the effect of the coupled dynamics and whether agents will still be able to agree fast enough under non-differentiability.

Distributed learning under non-smooth risk functions is common in many applications including distributed estimation and distributed machine learning. For example,

-regularization or hinge-loss functions (as in SVM implementations) lead to non-smooth risks. Several useful techniques have been developed in the literature for the solution of such distributed optimization problems, including the use of consensus strategies [9, 10, 11] and diffusion strategies [12, 13, 14, 15]. In this paper, we will focus on the Adapt-then-Combine (ATC) diffusion strategy mainly because diffusion strategies have been shown to have superior mean-square-error and stability performance in adaptive scenarios where agents are expected to continually learn from streaming data[15]. In particular, we shall examine the performance and stability behavior of networked diffusion learning under weaker conditions than previously considered in the literature. It is true that there have been several useful studies that employed sub-gradient constructions in the distributed setting before, most notably[16, 9, 17]. However, these earlier works generally assume bounded subgradients. As was already explained in Part I [2], this is a serious limitation (which does not hold even for quadratic risks where the gradient vector is linear in and grows unbounded). Instead, we shall consider the weaker affine Lipschitz condition (9), which was shown in Part I [2] to be satisfied automatically by important risk functions such as those arising in popular quadratic, SVM, and LASSO formulations.

Notation

: We use lowercase letters to denote vectors, uppercase letters for matrices, plain letters for deterministic variables, and boldface letters for random variables. We also use

to denote transposition, for matrix inversion, for the trace of a matrix,

for the eigenvalues of a matrix,

for the 2-norm of a matrix or the Euclidean norm of a vector, and for the spectral radius of a matrix. Besides, we use to denote that is positive semi-definite, and to denote that all entries of vector are positive.

Ii Problem Formulation: Multi-Agent Case

We now extend the single agent scenario analysis to multi-agent networks where a collection of agents cooperate with each other to seek the minimizer of a weighted aggregate cost of the form:

(15)

where refers to the agent index and is some positive weighting coefficient added for generality. When the are uniform and equal to each other, then (15) amounts to minimizing the aggregate sum of the individual risks . We can assume, without loss in generality, that the weights are normalized to add up to one

(16)

Each individual risk function continues to be expressed as the expected value of some loss function:

(17)

Here, the letter represents the random data at agent

and the expectation is over the distribution of this data. Many problems in adaptation and learning involve risk functions of this form, including, for example, mean-square-error designs and support vector machine (SVM) solutions — see, e.g.,

[18, 19, 20]. We again allow each risk function to be non-differentiable. This situation is common in machine learning formulations, e.g., in SVM costs and in regularized sparsity-inducing formulations.

We continue to assume that the individual costs satisfy Assumptions 1 and 2 described in the introduction section, namely, conditions (8), (9), and (10), which ensure that each is strongly-convex and its sub-gradient vectors are affine-Lipschitz with parameters ; we are attaching a subscript to these parameters to make them agent-dependent (alternatively, if desired, we can replace them by agent-independent parameters by using bounds on their values).

Ii-a Network Model

We consider a network consisting of separate agents connected by a topology. As described in [21, 12], we assign a pair of nonnegative weights, , to the edge connecting any two agents and . The scalar is used by agent to scale the data it receives from agent and similarly for . The network is said to be connected if paths with nonzero scaling weights can be found linking any two distinct agents in both directions. The network is said to be strongly–connected if it is connected with at least one self-loop, meaning that for some agent . Figure 1 shows one example of a strongly–connected network. For emphasis in this figure, each edge between two neighboring agents is represented by two directed arrows. The neighborhood of any agent is denoted by and it consists of all agents that are connected to by edges; we assume by default that this set includes agent regardless of whether agent has a self-loop or not.

Fig. 1: Agents that are linked by edges can share information. The neighborhood of agent is marked by the broken line and consists of the set .

There are several strategies that the agents can employ to seek the minimizer, , including consensus and diffusion strategies [11, 9, 10, 21, 12]. As noted earlier, in this work, we focus on the latter class since diffusion implementations have been shown to have superior stability and performance properties over consensus strategies when used in the context of adaptation and learning from streaming data (i.e., when the step-sizes are set to a constant value as opposed to a diminishing value) [21, 12, 15]. Although diminishing step-sizes annihilate the gradient noise term they, nevertheless, disable adaptation and learning in the long run. In comparison, constant step-size updates keep adaptation alive, but they allow gradient noise to seep into the operation of the algorithm. The challenge in these scenarios is therefore to show that the dynamics of the diffusion strategy over the network is such that the gradient noise effect does not degrade performance and that the network will be able to learn the unknown. This kind of analysis has been answered before in the affirmative for smooth twice-differentiable functions, — see [21, 12, 13, 14]. In this work, we want to pursue the analysis more generally for possibly non-differentiable risks in order to encompass important applications (such as SVM learning by multi-agents or LASSO and sparsity-aware learning by similar agents[22, 23, 24, 25]). We also want to pursue the analysis under the weaker affine-Lipschitz assumption (9) on the sub-gradients than the stronger conditions used in the prior literature, as we already explained in the earlier sections and in Part I [2].

Ii-B Diffusion Strategy

We consider the following diffusion strategy in its adapt-then-combine (ATC) form:

(18)

Here, the first step involves adaptation by agent by using a stochastic sub-gradient iteration, while the second step involves aggregation; we assume the gradient noise processes across all agents are independent of each other. The entries

define a left-stochastic matrix, namely, the entries of

are non-negative and each of its columns adds up to one. Since the network is strongly-connected, the combination matrix will be primitive [26, 21]. This implies that will admit a Jordan-decomposition of the form:

(19)

with a single eigenvalue at one and all other eigenvalues strictly inside the unit circle. The matrix has a Jordan structure with the ones that would typically appear along its first sub-diagonal replaced by a small positive number,

. Note that the eigenvectors of

corresponding to the eigenvalue at one are denoted by

(20)

where refers to a column vector with all its entries equal to one. It is further known from the Perron-Frobenius theorem [26] that the entries of are all strictly positive; we normalize them to add up to one. We denote the individual entries of by so that:

(21)

Furthermore, since , it holds that

(22)

Next, we introduce the vector

(23)

where is the weight associated with in (15). Since the designer is free to select the step-size parameters, it turns out that we can always relate the vectors in the following manner:

(24)

for some constant . Note, for instance, that for (24) to be valid the scalar should satisfy for all . To make this expression for independent of , we may parameterize (select) the step-sizes as

(25)

for some small . Then, , which is independent of and relation (24) is satisfied. Using (16) and (24) it is easy to check that

(26)

Note that since the are positive, smaller than one, and their sum is one, the above expression shows that can be interpreted as a weighted average step-size parameter.

Iii Network Performance

We are now ready to extend Theorem 1 from Part I[2] to the network case. The analysis is more challenging due to the coupling among the agents. But the result will establish that the distributed strategy is stable and converges exponentially fast for sufficiently small step-sizes. As was the case with Part I [2], the statement below is again in terms of pocket variables, which we define as follows.

At every iteration , the risk value that is attained by iterate is . This value is a random variable due to the randomness in the streaming data used to run the algorithm. We denote the mean risk value at agent by . We again introduce a best pocket iterate, denoted by . At any iteration , the value that is saved in this pocket variable is the iterate that has generated the smallest mean risk value up to time , i.e.,

(27)

Observe that in the network case we now have pocket values, one for each agent.

Theorem 1 (Network performance)

Consider using the stochastic sub-gradient diffusion algorithm (18) to seek the unique minimizer, , of the optimization problem (15), where the risk functions, , are assumed to satisfy assumptions (8), (10), and (12) with parameters . Assume the step-size parameter is sufficiently small (see condition (111)). Then, it holds that

The convergence of towards a neighborhood of size around occurs at an exponential rate, , dictated by the parameter

(29)

Condition (111) further ahead ensures .

: The argument is provided in Appendix A.

The above theorem clarifies the performance of the network in terms of the best pocket values across the agents. However, these pocket values are not readily available because the risk values, , cannot be evaluated. This is due to the fact that the statistical properties of the data are not known beforehand. As was the case with the single-agent scenario in Part I [2], a more practical conclusion can be deduced from the statement of the theorem as follows. We again introduce the geometric sum:

(30)

as well as the normalized and convex-combination coefficients:

(31)

Using these coefficients, we define a weighted iterate at each agent:

(32)

and observe that satisfies the recursive construction:

(33)

In particular, as , we have , and the above recursion simplifies in the limit to

(34)
Corollary 1 (Weighted iterates)

Under the same conditions as in Theorem 1, it holds that

(35)

and convergence continues to occur at the same exponential rate, .

The argument is provided in Appendix D.

Result (35) is an interesting conclusion. However, the statement is in terms of the averaged iterate whose computation requires knowledge of . This latter parameter is a global information, which is not readily available to all agents. Nevertheless, result (35) motivates the following useful distributed implementation with a similar guaranteed performance bound. We can replace by a design parameter, , that is no less than but still smaller than one, i.e., . Next, we introduce the weighted variable:

(36)

where now

(37)

and

(38)
Corollary 2 (Distributed Weighted iterates)

Under the same conditions as in Theorem 1 and , relation (35) continues to hold with in (32) replaced by (36). Moreover, convergence now occurs at the exponential rate .

The argument is similar to the proof of Corollary 2 from Part I [2].

For ease of reference, we summarize in the table below the listing of the stochastic subgradient learning algorithm with exponential smoothing for which Corollaries 1 and 2 hold.
 

Diffusion stochastic subgradient with exponential smoothing

  Initialization:
repeat for :
      for each agent :

(39)
(40)
(41)
(42)

       end
end
 

Iii-a Interpretation of Results

Examining the bound in (35), and comparing it with result (88) from Part I[2] for the single-agent case, we observe that the topology of the network is now reflected in the bound through the weighting factor, and step-size , which can be related to the Perron entry through (25). Recall from (20) that the are the entries of the right-eigenvector of corresponding to the eigenvalue at one. Moreover, the bound in (35)involves three terms (rather than only two as in the single-agent case — compared with (88) from Part I[2]):

  1. , which arises from the non-smoothness of the risk function;

  2. , which is due to gradient noise and the approximation of the true sub-gradient vector;

  3. , which is an extra term in comparison to the single agent case. We explained in (93) that the value of is related to how far the error at each agent is away from the weighted average error across the network. As for , this quantity represents the disagreement among the agents over . Because each function may have a different minimizer, is generally nonzero.

Iv Simulations

Example 1 (Multi-agent LASSO problem) We now consider the LASSO problem with 20 agents connected according to Fig. 2. A quick review of the LASSO problem is as follows. (A more detailed discussion and the relationship between the proposed assumptions (8)–(10) and the LASSO formulation can be found in Part I [2].) We consider follwing cost function for each agent:

(43)

where is a regularization parameter and denotes the norm of . The variable plays the role of a desired signal for agent , while

plays the role of a regression vector for the same agent. It is assumed that the regression data are zero-mean wide-sense stationary, and its distribution satisfies the standard Gaussian distribution, i.e.,

. We further assume that satisfy a linear model of the form generated through:

(44)

where and is some sparse random model for each agent. Each agent is allowed to have different regression and noise powers, as illustrated in Fig. 3. Under these modeling assumptions, we can determine a closed-form expression for as follows:

(45)

From first-order optimality conditions, we obtain[27]:

(46)

where the symbol represents the soft-thresholding function with parameter , i.e.,

(47)

and

(48)

where the notation , for a scalar , refers to the sign function:

(49)

For the stochastic sub-gradient implementation, the following instantaneous approximation for the sub-gradient is employed:

(50)

In Fig. 4, we compare the performance of this solution against several strategies including standard diffusion LMS[28, 21, 12]:

(51)

and sparse diffusion LMS[22, 24, 25] [23, Eq. 21].
 

Diffusion sparse LMS with expoential smoothing

  Initialization:
repeat for :
      for each agent :

(52)
(53)
(54)
(55)

       end
end
 

The parameter setting is as follows:

has 5 random non-zero entries uniformly distributed between 0.5 and 1.5, and

. We simply let and set the step-size for all agents at . From the simulations we find for the factor that appears in (LABEL:gj8923.d). As for the exponential smoothing factor , we chose .

Fig. 2: Network topology linking agents.
Fig. 3:

Feature and noise variances across the agents.

Fig. 4: The excess-risk curves for several strategies.

Example 2 (Multi-agent SVM learning) Next, we will consider the multi-agent SVM problem. Similar to LASSO problem, we provide a brief review for notation. More detailed discussion can be found in Part I [2]. The regularized SVM risk function for each agent is of the form:

(56)

where is a regularization parameter. We are generally given a collection of independent training data, , consisting of feature vectors and their class designations. We select and

(57)

One approximation for the sub-gradient construction at a generic location corresponding to generic data is

(58)

where the indicator function is defined as follows:

(59)

 

Diffusion SVM with exponential smoothing

  Initialization:
repeat for :
      for each agent :

(60)
(61)
(62)
(63)

       end
end
 

We distribute 32561 training data from an adult dataset111https://archive.ics.uci.edu/ml/datasets/Adult over a network consisting of 20 agents. We set and for all agents. From Example 6 in Part I [2] and Theorem 1, we know that for the multi-agent SVM problem:

(64)

We set , which usually guarantees . Fig. 5 (left) shows that cooperation among the agents outperforms the non-cooperative solution. Moreover, the distributed network can almost match the performance of the centralized LIBSVM solution[29]. We also examined the RCV1 dataset222https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Here we have 20242 training data points and we distribute them over 20 agents. We set the parameters to and (due to limited data). We now use since is not that small. The result is shown in Fig. 5 (right).

Fig. 5: Performance of diffusion SVM for the Adult dataset (Top) and RCV1 dataset (Bottom), where vertical axis measures the percentage of correct prediction over test dataset.

V Conclusion

In summary, we examined the performance of stochastic sub-gradient learning strategies over adaptive networks. We proposed a new affine-Lipschitz condition, which is quite suitable for strongly convex but non-differentiable cost functions and is automatically satisfied by several important cases including SVM, LASSO, Total-Variation denoising, etc. Under this weaker condition, the analysis establishes that sub-gradient strategies can attain exponential convergence rates, as opposed to sub-linear rates. The analysis also establishes that these strategies can approach the optimal solution within , for sufficiently small step-sizes.

Appendix A Proof of theorem 1

Introduce the error vector, . We collect the iterates and the respective errors from across the network into block column vectors:

(65)
(66)

We also define the extended quantities:

(67)
(68)
(69)
(70)
(71)

where denotes the Kronecker product operation, and denotes the gradient noise at agent . Using this notation, it is straightforward to verify that the network error vector generated by the diffusion strategy (18) evolves according to the following dynamics:

(72)

Motivated by the treatment of the smooth case in [21, 13, 14], we introduce a useful change of variables. Let and . Multiplying (72) from the left by gives

(73)

where from (19):

(74)

and

(75)

To proceed, we introduce

(77)
(78)

where the quantities amount to the weighted averages:

(79)
(80)
(81)

It is useful to observe the asymmetry reflected in the fact that is obtained by using the weights while the averages (77)–(78) are obtained by using the weights . We can now rewrite (73) as