Distributed stochastic optimization with gradient tracking over strongly-connected networks

03/18/2019
by   Ran Xin, et al.
Tufts University
Carnegie Mellon University
0

In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle (SFO), we propose a novel distributed method, called S-AB, where each agent uses an auxiliary variable to asymptotically track the gradient of the global cost in expectation. The S-AB algorithm employs row- and column-stochastic weights simultaneously to ensure both consensus and optimality. Since doubly-stochastic weights are not used, S-AB is applicable to arbitrary strongly-connected graphs. We show that under a sufficiently small constant step-size, S-AB converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer. We present numerical simulations based on real-world data sets to illustrate the theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/07/2022

Variance reduced stochastic optimization over directed graphs with row and column stochastic weights

This paper proposes AB-SAGA, a first-order distributed stochastic optimi...
01/21/2019

Distributed Nesterov gradient methods over arbitrary graphs

In this letter, we introduce a distributed Nesterov method, termed as AB...
05/15/2020

S-ADDOPT: Decentralized stochastic first-order optimization over directed graphs

In this report, we study decentralized stochastic optimization to minimi...
02/09/2020

Linearly Convergent Algorithm with Variance Reduction for Distributed Stochastic Optimization

This paper considers a distributed stochastic strongly convex optimizati...
06/19/2018

Distributed Optimization over Directed Graphs with Row Stochasticity and Constraint Regularity

This paper deals with an optimization problem over a network of agents, ...
10/11/2020

Three-Dimensional Swarming Using Cyclic Stochastic Optimization

In this paper we simulate an ensemble of cooperating, mobile sensing age...
07/31/2018

FADE: Fast and Asymptotically efficient Distributed Estimator for dynamic networks

Consider a set of agents that wish to estimate a vector of parameters of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the era of data deluge, where it is particularly difficult to store and process all data on a single device/node/processor, distributed schemes are becoming attractive for inference, learning, and optimization. Distributed optimization over multi-agent systems, thus, has been of significant interest in many areas including but not limited to machine learning 

[1, 2], big-data analytics [3, 4], and distributed control [5, 6]. However, the underlying algorithms must be designed to address practical limitations and realistic scenarios. For instance, with the computation and data collection/storage being pushed to the edge devices, e.g., in Internet of Things (IoT), the data available for distributed optimization is often inexact. Moreover, the ad hoc nature of setups outside of data centers requires the algorithms to be amenable to communication protocols that are not necessarily bidirectional. The focus of this paper is to study and characterize distributed optimization schemes where the inter-agent communication is restricted to directed graphs and the information/data is inexact.

In particular, we study distributed stochastic optimization over directed graphs and propose the - algorithm to minimize a sum of local cost functions. The - algorithm assumes access to a stochastic first-order oracle (), i.e., when an agent queries the 

, it gets an unbiased estimate of the gradient of its local cost function. In the proposed approach, each agent makes a weighted average of its own and its neighbors’ solution estimates, and simultaneously incorporates its local gradient estimate of the global cost function. The exchange of solution estimates is performed over a row-stochastic weight matrix. In parallel, each agent maintains its own estimate of the gradient of the global cost function, by simultaneously incorporating a weighted average of its and its neighbors’ gradient estimates and its local gradient tracking estimate. The exchange of gradient estimates of the global cost function is performed over a column-stochastic weight matrix. Since doubly-stoachstic weights are nowhere used, 

- is an attractive solution that is applicable to arbitrary, strongly-connected graphs.

The main contributions of this paper are as follows: (i) We show that, by choosing a sufficiently small constant step-size, , - converges linearly to a neighborhood of the global minimizer. This convergence guarantee is achieved for continuously-differentiable, strongly-convex, local cost functions, where each agent is assumed to have access to a 

and the gradient noise has zero-mean and bounded variance.

(ii) We provide explicit expressions of the appropriate norms under which the row- and column-stochastic weight matrices contract. With the help of these norms, we develop sharp and explicit convergence arguments.

We now briefly review the literature concerning distributed and stochastic optimization. Early work on deterministic finite-sum problems includes [7, 8, 9], while work on stochastic problems can be found in [10, 11]. Recently, gradient tracking has been proposed where the local gradient at each agent is replaced by the estimate of the global gradient [12, 13, 14, 15]. Methods for directed graphs that are based on gradient tracking [16, 17, 18, 19, 14, 15, 20, 21]

rely on separate iterations for eigenvector estimation that may impede the convergence. This issue was recently resolved in 

[22, 23], see also [24, 25, 26, 27, 28] for the follow-up work, where eigenvector estimation was removed with the help of a unique approach that uses both row- and column-stochastic weights. Ref. [22] derives linear convergence of the finite-sum problem when the underlying functions are smooth and strongly-convex, however, since arbitrary norms are used in the analysis, the convergence bounds are not sharp. Recent related work on time-varying networks and other approaches can be found in [29, 30, 31, 32, 33], albeit, without gradient tracking. Of significant relevance is [34], where a similar setup with gradient tracking is considered over undirected graphs. We note that - generalizes [34] and the analysis in [34] relies on the weight matrix contraction in -norm that is not applicable here.

We now describe the rest of the paper. Section II describes the problem, assumptions, and some auxiliary results. We present the convergence analysis in Section III and the main result in Section IV. Finally, Section V provides the numerical experiments and Section VI concludes the paper.

Basic Notation:

We use lowercase bold letters for vectors and uppercase italic letters for matrices. We use 

for the identity matrix, and  for the column of  ones. For an arbitrary vector, , we denote its th element by  and its smallest element by  and its largest element by . Inequalities involving matrices and vectors are to be interpreted componentwise. For a matrix, , we denote  as its spectral radius and  as its infinite power (if it exists), i.e., 

. For a primitive, row-stochastic matrix

, we denote its left and right eigenvectors corresponding to the eigenvalue of 

by  and , respectively, such that  and . Similarly, for a primitive, column-stochastic matrix, , we have .

Ii Problem formulation and Auxiliary Results

Consider  agents connected over a directed graph, , where  is the set of agents, and 

is the collection of ordered pairs

, such that agent  can send information to agent . We assume that . The agents solve the following problem:

(1)

where each  is known only to agent . We now formalize the assumptions.

Assumption 1

Each local objective, , is -strongly-convex, i.e.,  and . Thus, we have

Under Assumption 1, the optimal solution for Problem P1 exists and is unique, which we denote as .

Assumption 2

Each local objective, , is -smooth, i.e., its gradient is Lipschitz-continuous:  and , we have, for some ,

We make the following assumption on the agent communication graph, which guarantees the existence of a directed path from each agent to each agent .

Assumption 3

The graph, , is strongly-connected.

We consider distributed iterative algorithms to solve Problem P1, where each agent is able to call a stochastic first-order oracle (). At iteration  and agent , given  as the input,  returns a stochastic gradient in the form of , where  are random vectors, . The stochastic gradients, , satisfy the following standard assumptions:

Assumption 4

The set of random vectors  are independent of each other, and

  1. ,

  2. .

Assumption 4 is satisfied in many scenarios, for example, when the gradient noise, 

, is independent and identically distributed (i.i.d.) with zero-mean and finite second moment, while being independent of 

. However, Assumption 4 allows for general gradient noise processes dependent on agent and the current iterate . Finally, we denote by  the -algebra generated by the set of random vectors .

Ii-a The - algorithm

We now describe the proposed algorithm, -, to solve Problem P1. Each agent  maintains two state vectors,  and , both in , where  is the number of iterations. The variable  is the estimate of the global minimizer , while  is the global gradient estimator. The - algorithm, initialized with arbitrary ’s and with , is given by the following:

(2a)
(2b)

where the weight matrices  and  are row- and column-stochastic, respectively, and follow the graph topology, i.e.,  and , iff . We next write the algorithm in a compact vector form for the sake of analysis.

(3a)
(3b)

where we use the following notation:

and 

Note that when the variance, , of the stochastic gradients is 0, we recover the  or the push-pull algorithm proposed in [22, 23]. In the following, we assume  for the sake of simplicity. The analysis can be extended to the general case of  with the help of Kronecker products.

Ii-B Auxiliary Results

We now provide some auxiliary results to aid the convergence analysis of -. We first develop explicit norms regarding the contractions of the weight matrices,  and . Since both  and  are primitive and stochastic, we use their non- Perron vectors,  and , respectively, to define two weighted inner products as follows: ,

The above inner products are well-defined because the Perron vectors,  and , are positive and respectively induce a weighted Euclidean norm as follows: ,

We denote  and  as the matrix norms induced by  and , respectively, i.e., , see [35],

(4)
(5)

It can be verified that the corresponding norm equivalence relationships between , and  are given by

We next establish the contraction of the  and  matrices with the help of the above arguments.

Lemma 1

For the matrices , and , we have:

(6)
(7)

with  and .

The proof of Lemma. 1 is available in the Appendix. It can be further verified that

where 

is the second largest singular value of a matrix.

In the following, Lemma 2 provides some simple results on the stochastic gradients, Lemma 3 uses the -smoothness of the cost functions, while Lemmas 4 and 5 are standard in convex optimization and matrix analysis. To present these results, we define three quantities:

where  The following statements use standard arguments and their formal proofs are omitted due to space limitations. Similar results can be found in [13, 22, 34].

Lemma 2

Consider - in (2) and let Assumptions 2-4 hold. Then the following hold, :

Lemma 3

Consider the - algorithm in (2) and let Assumptions 2 hold. Then the following holds, :

Lemma 4 ([36])

Let Assumptions 1-2 hold. If , we have: ,

Lemma 5 ([35])

Let  be non-negative and  be a positive vector. If  with , then .

Iii Convergence analysis

In this section, we analyze the - algorithm and establish its convergence properties for which we present Lemmas 6-9. The proofs for these lemmas are provided in the Appendix. First, in Lemma 6, we bound .

Lemma 6

Let Assumptions 1-4 hold. Then the iterates generated by - in (3) follow:

(8)

Next in Lemmas 7-9, we bound the following three quantities in expectation, conditioned on the -algebra : (i) , the consensus error in the network; (ii) , the optimality gap; and, (iii) , the gradient tracking error. We then show that the norm of a vector composed of these three quantities converges linearly to a ball around the optimal when the step-size  is fixed and sufficiently small. The first lemma below is on the consensus error.

Lemma 7

Let Assumption 3 hold. Then the consensus error in the network follows:

(9)

The next lemma is on the optimality gap.

Lemma 8

Let Assumptions 1-4 hold. If , the optimality gap in the network follows:

(10)

Finally, we quantify the gradient tracking error.

Lemma 9

Let the hypotheses of Lemma 2 hold. The gradient tracking error follows:

(11)

With the help of the above lemmas, we define a vector, , i.e.,

By substituting the bound on  from Lemma 6 in Lemmas 7-9, and taking the full expectation of both sides, it can be verified that  follows the dynamical system below.

(12)

where the constants are given by

Iv Main Result

In this section, we analyze the inequality on  to establish the convergence of -.

Theorem 1

Consider the - algorithm in (2) and let Assumptions 1-4 hold. Suppose the step-size  satisfies the following the condition:

Then, , the vector has non-negative components, and we have that

where the above convergence is geometric with exponent .

The goal is to find the range of  such that . In the light of Lemma 5, it suffices to solve for the range of  such that  holds for some positive vector . We now expand this element-wise matrix inequality as follows:

which can be reformulated as follows:

(13)
(14)
(15)

We now set  as

(16)

Then for (13) to hold, it suffices to require

(17)

One can verify that with the choices of  provided in (16), (14) holds. Lastly, for (15) to hold, we have

(18)

Therefore, (17) and (18) together with the requirement that  from Lemma 8 complete the proof. It is important to note that the error bounds in Theorem 1 go to zero as the step-size gets smaller and the variance on the gradient noise decreases.

V Numerical Experiments

In this section, we illustrate the - algorithm and its convergence properties. We demonstrate the results on a directed graph generated using nearest neighbor rules with  agents. The particular graph for the experiments is shown in Fig. 1

(left) to provide a sense of connectivity. We choose a logistic regression problem to classify around 

images of two digits,  and , labeled as  or , from the MNIST dataset [37]. Each image, , is a -dimensional vector and the total images are divided among the agents such that each agent has  images. Because privacy and communication restrictions, the agents do not share their local batches (local training images) with each other. In order to use the entire data set for training, the network of agents cooperatively solves the following distributed logistic regression problem:

where the private function at each agent, , is given by:

We show the performance of this classification problem over centralized and distributed methods. Centralized gradient descent (CGD) uses the entire batch, i.e., it computes 

gradients at each iteration, whereas centralized stochastic gradient descent (C-SGD) uses only one data point at each iteration that is uniformly sampled from the entire batch. For the distributed algorithms, we show the performance of non-stochastic 

, where each agent uses its entire local batch, i.e.,  labeled data points. Whereas, for the implementation of -, each agent uniformly chooses one data point from its local batch. For testing, we use  additional images that were not used for training. The residuals are shown in Fig. 1 (right) while the training and testing accuracy is shown in Fig. 2

. In the performance figures, the horizontal axis represents the number of epochs where each epoch represents computations on the entire batch. Clearly, 

- has a better performance when compared to  in [22] as expected from the performance of their centralized counterparts, C-SGD and CGD.

Fig. 1: (Left) Strongly-connected directed graph. (Right) Residuals.
Fig. 2: (Left) Training accuracy. (Right) Test accuracy.

Vi Conclusions

In this paper, we have presented a stochastic gradient descent algorithm, -, over arbitrary strongly-connected graphs. In this setup, the data is distributed over agents and each agent uniformly samples a data point (from its local batch) at each iteration of the algorithm to implement the stochastic - algorithm. To cope with general directed communication graphs and potential lack of doubly-stochastic weights, - employs a two-phase update with row- and column-stochastic weights. We have shown that under a sufficiently small constant step-size, - converges linearly to a neighborhood of the global minimizer when the local cost functions are smooth and strongly-convex. We have presented numerical simulations based on real-world datasets to illustrate the theoretical results.

References

  • [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
  • [2] H. Raja and W. U. Bajwa, “Cloud K-SVD: A collaborative dictionary learing algorithm for big, distributed data,” IEEE Trans. Signal Processing, vol. 64, no. 1, pp. 173–188, Jan. 2016.
  • [3] A. Daneshmand, F. Facchinei, V. Kungurtsev, and G. Scutari, “Hybrid random/deterministic parallel algorithms for convex and nonconvex big data optimization,” IEEE Trans. on Signal Processing, vol. 63, no. 15, pp. 3914–3929, 2015.
  • [4] D. Jakovetić, J. Xavier, and J. M. F. Moura, “Fast distributed gradient methods,” IEEE Trans. on Automatic Control, vol. 59, no. 5, pp. 1131–1146, May 2014.
  • [5] F. Bullo, J. Cortés, and S. Martinez, Distributed control of robotic networks: A mathematical approach to motion coordination algorithms, Princeton University Press, 2009.
  • [6] S. Lee and M. M. Zavlanos, “Approximate projection methods for decentralized optimization with functional constraints,” IEEE Trans. on Automatic Control, 2017.
  • [7] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.
  • [8] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent optimization with state-dependent communication,” Mathematical Programming, vol. 129, no. 2, pp. 255–284, 2011.
  • [9] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Trans. on Automatic Control, vol. 57, no. 3, pp. 592–606, Mar. 2012.
  • [10] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,” Journal of optimization theory and applications, vol. 147, no. 3, pp. 516–545, 2010.
  • [11] A. Nedić and A. Olshevsky, “Stochastic gradient-push for strongly convex functions on time-varying directed graphs,” IEEE Trans. on Automatic Control, vol. 61, no. 12, pp. 3936–3947, Dec. 2016.
  • [12] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in IEEE 54th Annual Conference on Decision and Control, 2015, pp. 2055–2060.
  • [13] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. on Control of Network Systems, Apr. 2017.
  • [14] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal of Optimization, vol. 27, no. 4, pp. 2597–2633, Dec. 2017.
  • [15] C. Xi, R. Xin, and U. A. Khan, “ADD-OPT: Accelerated distributed directed optimization,” IEEE Trans. on Automatic Control, Aug. 2017, in press.
  • [16] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in 51st IEEE Annual Conference on Decision and Control, Maui, Hawaii, Dec. 2012, pp. 5453–5458.
  • [17] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Trans. on Automatic Control, vol. 60, no. 3, pp. 601–615, Mar. 2015.
  • [18] C. Xi, Q. Wu, and U. A. Khan, “On the distributed optimization over directed networks,” Neurocomputing, vol. 267, pp. 508–515, Dec. 2017.
  • [19] C. Xi and U. A. Khan, “Distributed subgradient projection algorithm over directed graphs,” IEEE Trans. on Automatic Control, vol. 62, no. 8, pp. 3986–3992, Oct. 2016.
  • [20] C. Xi, V. S. Mai, R. Xin, E. Abed, and U. A. Khan, “Linear convergence in optimization over directed graphs with row-stochastic matrices,” IEEE Trans. on Automatic Control, Jan. 2018, in press.
  • [21] R. Xin, C. Xi, and U. A. Khan, “FROST – Fast row-stochastic optimization with uncoordinated step-sizes,” Arxiv: https://arxiv.org/abs/1803.09169, Mar. 2018.
  • [22] R. Xin and U. A. Khan, “A linear algorithm for optimization over directed graphs with geometric convergence,” IEEE Control Systems Letters, vol. 2, no. 3, pp. 325–330, Jul. 2018.
  • [23] S. Pu, W. Shi, J. Xu, and A. Nedić, “A push-pull gradient method for distributed optimization in networks,” in 57th IEEE Annual Conference on Decision and Control, Dec. 2018.
  • [24] R. Xin and U. A. Khan, “Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking,” arXiv preprint arXiv:1808.02942, 2018.
  • [25] S. Pu, W. Shi, J. Xu, and A. Nedić, “Push-pull gradient methods for distributed optimization in networks,” https://arxiv.org/abs/1810.06653, 2018.
  • [26] F. Saadatniaki, R. Xin, and U. A. Khan, “Optimization over time-varying directed graphs with row and column-stochastic matrices,” arXiv preprint arXiv:1810.07393, 2018.
  • [27] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” arXiv preprint arXiv:1809.08694, 2018.
  • [28] R. Xin, D. Jakovetic, and U. A. Khan, “Distributed Nesterov gradient methods over arbitrary graphs,” IEEE Signal Processing Letters, Jan. 2019, Arxiv: 1901.06995.
  • [29] D. Yuan, Y. Hong, D. W. C. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, Apr. 2018.
  • [30] N. Denizcan Vanli, Muhammed O. Sayin, and Suleyman S. Kozat, “Stochastic subgradient algorithms for strongly convex optimization over distributed networks,” IEEE Trans. on Network Science and Engineering, vol. 4, no. 4, pp. 248–260, Oct. 2017.
  • [31] D. Jakovetić, D. Bajović, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in IEEE Conference on Decision and Control, Dec. 2018, pp. 4238–4245.
  • [32] A. K. Sahu, D. Jakovetić, D. Bajović, and S. Kar, “Distributed zeroth order optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,” in Conference on Decision and Control, Dec. 2018, pp. 4951–4958.
  • [33] D. Jakovetic, “A unification and generalization of exact distributed first order methods,” IEEE Trans. on Signal and Information Processing over Networks, 2018.
  • [34] S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in 2018 IEEE Conference on Decision and Control (CDC), Dec. 2018, pp. 963–968.
  • [35] R. A. Horn and C. R. Johnson, Matrix Analysis, 2 ed., Cambridge University Press, New York, NY, 2013.
  • [36] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
  • [37] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2, pp. 18, 2010.

Proof of Lemma 1

We start with the proof of (6). Note that  that leads to

By the definition of  and Eq. (4), we have

where  denotes the largest eigenvalue of the matrix and . What we need to show is that . Expanding , we get

With the fact that , it can be verified that  and thus  Furthermore, , and  Since  is primitive, by Perron-Frobenius theorem [35], we have