I Introduction
In the era of data deluge, where it is particularly difficult to store and process all data on a single device/node/processor, distributed schemes are becoming attractive for inference, learning, and optimization. Distributed optimization over multi-agent systems, thus, has been of significant interest in many areas including but not limited to machine learning
[1, 2], big-data analytics [3, 4], and distributed control [5, 6]. However, the underlying algorithms must be designed to address practical limitations and realistic scenarios. For instance, with the computation and data collection/storage being pushed to the edge devices, e.g., in Internet of Things (IoT), the data available for distributed optimization is often inexact. Moreover, the ad hoc nature of setups outside of data centers requires the algorithms to be amenable to communication protocols that are not necessarily bidirectional. The focus of this paper is to study and characterize distributed optimization schemes where the inter-agent communication is restricted to directed graphs and the information/data is inexact.In particular, we study distributed stochastic optimization over directed graphs and propose the - algorithm to minimize a sum of local cost functions. The - algorithm assumes access to a stochastic first-order oracle (), i.e., when an agent queries the
, it gets an unbiased estimate of the gradient of its local cost function. In the proposed approach, each agent makes a weighted average of its own and its neighbors’ solution estimates, and simultaneously incorporates its local gradient estimate of the global cost function. The exchange of solution estimates is performed over a row-stochastic weight matrix. In parallel, each agent maintains its own estimate of the gradient of the global cost function, by simultaneously incorporating a weighted average of its and its neighbors’ gradient estimates and its local gradient tracking estimate. The exchange of gradient estimates of the global cost function is performed over a column-stochastic weight matrix. Since doubly-stoachstic weights are nowhere used,
- is an attractive solution that is applicable to arbitrary, strongly-connected graphs.The main contributions of this paper are as follows:
(i) We show that, by choosing a sufficiently small constant step-size, , - converges linearly to a neighborhood of the global minimizer. This convergence guarantee is achieved for continuously-differentiable, strongly-convex, local cost functions, where each agent is assumed to have access to a and the gradient noise has zero-mean and bounded variance.
We now briefly review the literature concerning distributed and stochastic optimization. Early work on deterministic finite-sum problems includes [7, 8, 9], while work on stochastic problems can be found in [10, 11]. Recently, gradient tracking has been proposed where the local gradient at each agent is replaced by the estimate of the global gradient [12, 13, 14, 15]. Methods for directed graphs that are based on gradient tracking [16, 17, 18, 19, 14, 15, 20, 21]
rely on separate iterations for eigenvector estimation that may impede the convergence. This issue was recently resolved in
[22, 23], see also [24, 25, 26, 27, 28] for the follow-up work, where eigenvector estimation was removed with the help of a unique approach that uses both row- and column-stochastic weights. Ref. [22] derives linear convergence of the finite-sum problem when the underlying functions are smooth and strongly-convex, however, since arbitrary norms are used in the analysis, the convergence bounds are not sharp. Recent related work on time-varying networks and other approaches can be found in [29, 30, 31, 32, 33], albeit, without gradient tracking. Of significant relevance is [34], where a similar setup with gradient tracking is considered over undirected graphs. We note that - generalizes [34] and the analysis in [34] relies on the weight matrix contraction in -norm that is not applicable here.We now describe the rest of the paper. Section II describes the problem, assumptions, and some auxiliary results. We present the convergence analysis in Section III and the main result in Section IV. Finally, Section V provides the numerical experiments and Section VI concludes the paper.
Basic Notation:
We use lowercase bold letters for vectors and uppercase italic letters for matrices. We use
for the identity matrix, and for the column of ones. For an arbitrary vector, , we denote its th element by and its smallest element by and its largest element by . Inequalities involving matrices and vectors are to be interpreted componentwise. For a matrix, , we denote as its spectral radius and as its infinite power (if it exists), i.e.,. For a primitive, row-stochastic matrix,
, we denote its left and right eigenvectors corresponding to the eigenvalue of
by and , respectively, such that and . Similarly, for a primitive, column-stochastic matrix, , we have .Ii Problem formulation and Auxiliary Results
Consider agents connected over a directed graph, , where is the set of agents, and
is the collection of ordered pairs,
, such that agent can send information to agent . We assume that . The agents solve the following problem:(1) |
where each is known only to agent . We now formalize the assumptions.
Assumption 1
Each local objective, , is -strongly-convex, i.e., and . Thus, we have
Under Assumption 1, the optimal solution for Problem P1 exists and is unique, which we denote as .
Assumption 2
Each local objective, , is -smooth, i.e., its gradient is Lipschitz-continuous: and , we have, for some ,
We make the following assumption on the agent communication graph, which guarantees the existence of a directed path from each agent to each agent .
Assumption 3
The graph, , is strongly-connected.
We consider distributed iterative algorithms to solve Problem P1, where each agent is able to call a stochastic first-order oracle (). At iteration and agent , given as the input, returns a stochastic gradient in the form of , where are random vectors, . The stochastic gradients, , satisfy the following standard assumptions:
Assumption 4
The set of random vectors are independent of each other, and
-
,
-
.
Assumption 4 is satisfied in many scenarios, for example, when the gradient noise,
, is independent and identically distributed (i.i.d.) with zero-mean and finite second moment, while being independent of
. However, Assumption 4 allows for general gradient noise processes dependent on agent and the current iterate . Finally, we denote by the -algebra generated by the set of random vectors .Ii-a The - algorithm
We now describe the proposed algorithm, -, to solve Problem P1. Each agent maintains two state vectors, and , both in , where is the number of iterations. The variable is the estimate of the global minimizer , while is the global gradient estimator. The - algorithm, initialized with arbitrary ’s and with , , is given by the following:
(2a) | ||||
(2b) |
where the weight matrices and are row- and column-stochastic, respectively, and follow the graph topology, i.e., and , iff . We next write the algorithm in a compact vector form for the sake of analysis.
(3a) | ||||
(3b) |
where we use the following notation:
and
Ii-B Auxiliary Results
We now provide some auxiliary results to aid the convergence analysis of -. We first develop explicit norms regarding the contractions of the weight matrices, and . Since both and are primitive and stochastic, we use their non- Perron vectors, and , respectively, to define two weighted inner products as follows: ,
The above inner products are well-defined because the Perron vectors, and , are positive and respectively induce a weighted Euclidean norm as follows: ,
We denote and as the matrix norms induced by and , respectively, i.e., , see [35],
(4) | |||
(5) |
It can be verified that the corresponding norm equivalence relationships between , , and are given by
We next establish the contraction of the and matrices with the help of the above arguments.
Lemma 1
For the matrices , and , we have:
(6) | |||
(7) |
with and .
The proof of Lemma. 1 is available in the Appendix. It can be further verified that
where
is the second largest singular value of a matrix.
In the following, Lemma 2 provides some simple results on the stochastic gradients, Lemma 3 uses the -smoothness of the cost functions, while Lemmas 4 and 5 are standard in convex optimization and matrix analysis. To present these results, we define three quantities:
where The following statements use standard arguments and their formal proofs are omitted due to space limitations. Similar results can be found in [13, 22, 34].
Lemma 5 ([35])
Let be non-negative and be a positive vector. If with , then .
Iii Convergence analysis
In this section, we analyze the - algorithm and establish its convergence properties for which we present Lemmas 6-9. The proofs for these lemmas are provided in the Appendix. First, in Lemma 6, we bound .
Next in Lemmas 7-9, we bound the following three quantities in expectation, conditioned on the -algebra : (i) , the consensus error in the network; (ii) , the optimality gap; and, (iii) , the gradient tracking error. We then show that the norm of a vector composed of these three quantities converges linearly to a ball around the optimal when the step-size is fixed and sufficiently small. The first lemma below is on the consensus error.
Lemma 7
Let Assumption 3 hold. Then the consensus error in the network follows:
(9) |
The next lemma is on the optimality gap.
Finally, we quantify the gradient tracking error.
Lemma 9
Let the hypotheses of Lemma 2 hold. The gradient tracking error follows:
(11) |
Iv Main Result
In this section, we analyze the inequality on to establish the convergence of -.
Theorem 1
The goal is to find the range of such that . In the light of Lemma 5, it suffices to solve for the range of such that holds for some positive vector . We now expand this element-wise matrix inequality as follows:
which can be reformulated as follows:
(13) | ||||
(14) | ||||
(15) |
We now set as
(16) |
Then for (13) to hold, it suffices to require
(17) |
One can verify that with the choices of provided in (16), (14) holds. Lastly, for (15) to hold, we have
(18) |
Therefore, (17) and (18) together with the requirement that from Lemma 8 complete the proof. It is important to note that the error bounds in Theorem 1 go to zero as the step-size gets smaller and the variance on the gradient noise decreases.
V Numerical Experiments
In this section, we illustrate the - algorithm and its convergence properties. We demonstrate the results on a directed graph generated using nearest neighbor rules with agents. The particular graph for the experiments is shown in Fig. 1
(left) to provide a sense of connectivity. We choose a logistic regression problem to classify around
images of two digits, and , labeled as or , from the MNIST dataset [37]. Each image, , is a -dimensional vector and the total images are divided among the agents such that each agent has images. Because privacy and communication restrictions, the agents do not share their local batches (local training images) with each other. In order to use the entire data set for training, the network of agents cooperatively solves the following distributed logistic regression problem:where the private function at each agent, , is given by:
We show the performance of this classification problem over centralized and distributed methods. Centralized gradient descent (CGD) uses the entire batch, i.e., it computes
gradients at each iteration, whereas centralized stochastic gradient descent (C-SGD) uses only one data point at each iteration that is uniformly sampled from the entire batch. For the distributed algorithms, we show the performance of non-stochastic
, where each agent uses its entire local batch, i.e., labeled data points. Whereas, for the implementation of -, each agent uniformly chooses one data point from its local batch. For testing, we use additional images that were not used for training. The residuals are shown in Fig. 1 (right) while the training and testing accuracy is shown in Fig. 2. In the performance figures, the horizontal axis represents the number of epochs where each epoch represents computations on the entire batch. Clearly,
- has a better performance when compared to in [22] as expected from the performance of their centralized counterparts, C-SGD and CGD.![]() |
![]() |
![]() |
![]() |
Vi Conclusions
In this paper, we have presented a stochastic gradient descent algorithm, -, over arbitrary strongly-connected graphs. In this setup, the data is distributed over agents and each agent uniformly samples a data point (from its local batch) at each iteration of the algorithm to implement the stochastic - algorithm. To cope with general directed communication graphs and potential lack of doubly-stochastic weights, - employs a two-phase update with row- and column-stochastic weights. We have shown that under a sufficiently small constant step-size, - converges linearly to a neighborhood of the global minimizer when the local cost functions are smooth and strongly-convex. We have presented numerical simulations based on real-world datasets to illustrate the theoretical results.
References
- [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
- [2] H. Raja and W. U. Bajwa, “Cloud K-SVD: A collaborative dictionary learing algorithm for big, distributed data,” IEEE Trans. Signal Processing, vol. 64, no. 1, pp. 173–188, Jan. 2016.
- [3] A. Daneshmand, F. Facchinei, V. Kungurtsev, and G. Scutari, “Hybrid random/deterministic parallel algorithms for convex and nonconvex big data optimization,” IEEE Trans. on Signal Processing, vol. 63, no. 15, pp. 3914–3929, 2015.
- [4] D. Jakovetić, J. Xavier, and J. M. F. Moura, “Fast distributed gradient methods,” IEEE Trans. on Automatic Control, vol. 59, no. 5, pp. 1131–1146, May 2014.
- [5] F. Bullo, J. Cortés, and S. Martinez, Distributed control of robotic networks: A mathematical approach to motion coordination algorithms, Princeton University Press, 2009.
- [6] S. Lee and M. M. Zavlanos, “Approximate projection methods for decentralized optimization with functional constraints,” IEEE Trans. on Automatic Control, 2017.
- [7] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.
- [8] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent optimization with state-dependent communication,” Mathematical Programming, vol. 129, no. 2, pp. 255–284, 2011.
- [9] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Trans. on Automatic Control, vol. 57, no. 3, pp. 592–606, Mar. 2012.
- [10] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,” Journal of optimization theory and applications, vol. 147, no. 3, pp. 516–545, 2010.
- [11] A. Nedić and A. Olshevsky, “Stochastic gradient-push for strongly convex functions on time-varying directed graphs,” IEEE Trans. on Automatic Control, vol. 61, no. 12, pp. 3936–3947, Dec. 2016.
- [12] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes,” in IEEE 54th Annual Conference on Decision and Control, 2015, pp. 2055–2060.
- [13] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. on Control of Network Systems, Apr. 2017.
- [14] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal of Optimization, vol. 27, no. 4, pp. 2597–2633, Dec. 2017.
- [15] C. Xi, R. Xin, and U. A. Khan, “ADD-OPT: Accelerated distributed directed optimization,” IEEE Trans. on Automatic Control, Aug. 2017, in press.
- [16] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dual averaging for convex optimization,” in 51st IEEE Annual Conference on Decision and Control, Maui, Hawaii, Dec. 2012, pp. 5453–5458.
- [17] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Trans. on Automatic Control, vol. 60, no. 3, pp. 601–615, Mar. 2015.
- [18] C. Xi, Q. Wu, and U. A. Khan, “On the distributed optimization over directed networks,” Neurocomputing, vol. 267, pp. 508–515, Dec. 2017.
- [19] C. Xi and U. A. Khan, “Distributed subgradient projection algorithm over directed graphs,” IEEE Trans. on Automatic Control, vol. 62, no. 8, pp. 3986–3992, Oct. 2016.
- [20] C. Xi, V. S. Mai, R. Xin, E. Abed, and U. A. Khan, “Linear convergence in optimization over directed graphs with row-stochastic matrices,” IEEE Trans. on Automatic Control, Jan. 2018, in press.
- [21] R. Xin, C. Xi, and U. A. Khan, “FROST – Fast row-stochastic optimization with uncoordinated step-sizes,” Arxiv: https://arxiv.org/abs/1803.09169, Mar. 2018.
- [22] R. Xin and U. A. Khan, “A linear algorithm for optimization over directed graphs with geometric convergence,” IEEE Control Systems Letters, vol. 2, no. 3, pp. 325–330, Jul. 2018.
- [23] S. Pu, W. Shi, J. Xu, and A. Nedić, “A push-pull gradient method for distributed optimization in networks,” in 57th IEEE Annual Conference on Decision and Control, Dec. 2018.
- [24] R. Xin and U. A. Khan, “Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking,” arXiv preprint arXiv:1808.02942, 2018.
- [25] S. Pu, W. Shi, J. Xu, and A. Nedić, “Push-pull gradient methods for distributed optimization in networks,” https://arxiv.org/abs/1810.06653, 2018.
- [26] F. Saadatniaki, R. Xin, and U. A. Khan, “Optimization over time-varying directed graphs with row and column-stochastic matrices,” arXiv preprint arXiv:1810.07393, 2018.
- [27] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-order guarantees of distributed gradient algorithms,” arXiv preprint arXiv:1809.08694, 2018.
- [28] R. Xin, D. Jakovetic, and U. A. Khan, “Distributed Nesterov gradient methods over arbitrary graphs,” IEEE Signal Processing Letters, Jan. 2019, Arxiv: 1901.06995.
- [29] D. Yuan, Y. Hong, D. W. C. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, Apr. 2018.
- [30] N. Denizcan Vanli, Muhammed O. Sayin, and Suleyman S. Kozat, “Stochastic subgradient algorithms for strongly convex optimization over distributed networks,” IEEE Trans. on Network Science and Engineering, vol. 4, no. 4, pp. 248–260, Oct. 2017.
- [31] D. Jakovetić, D. Bajović, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in IEEE Conference on Decision and Control, Dec. 2018, pp. 4238–4245.
- [32] A. K. Sahu, D. Jakovetić, D. Bajović, and S. Kar, “Distributed zeroth order optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,” in Conference on Decision and Control, Dec. 2018, pp. 4951–4958.
- [33] D. Jakovetic, “A unification and generalization of exact distributed first order methods,” IEEE Trans. on Signal and Information Processing over Networks, 2018.
- [34] S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in 2018 IEEE Conference on Decision and Control (CDC), Dec. 2018, pp. 963–968.
- [35] R. A. Horn and C. R. Johnson, Matrix Analysis, 2 ed., Cambridge University Press, New York, NY, 2013.
- [36] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
- [37] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2, pp. 18, 2010.
Proof of Lemma 1
We start with the proof of (6). Note that that leads to
By the definition of and Eq. (4), we have
where denotes the largest eigenvalue of the matrix and . What we need to show is that . Expanding , we get
With the fact that , it can be verified that and thus Furthermore, , and Since is primitive, by Perron-Frobenius theorem [35], we have
Comments
There are no comments yet.