I Introduction
In the era of data deluge, where it is particularly difficult to store and process all data on a single device/node/processor, distributed schemes are becoming attractive for inference, learning, and optimization. Distributed optimization over multiagent systems, thus, has been of significant interest in many areas including but not limited to machine learning
[1, 2], bigdata analytics [3, 4], and distributed control [5, 6]. However, the underlying algorithms must be designed to address practical limitations and realistic scenarios. For instance, with the computation and data collection/storage being pushed to the edge devices, e.g., in Internet of Things (IoT), the data available for distributed optimization is often inexact. Moreover, the ad hoc nature of setups outside of data centers requires the algorithms to be amenable to communication protocols that are not necessarily bidirectional. The focus of this paper is to study and characterize distributed optimization schemes where the interagent communication is restricted to directed graphs and the information/data is inexact.In particular, we study distributed stochastic optimization over directed graphs and propose the  algorithm to minimize a sum of local cost functions. The  algorithm assumes access to a stochastic firstorder oracle (), i.e., when an agent queries the
, it gets an unbiased estimate of the gradient of its local cost function. In the proposed approach, each agent makes a weighted average of its own and its neighbors’ solution estimates, and simultaneously incorporates its local gradient estimate of the global cost function. The exchange of solution estimates is performed over a rowstochastic weight matrix. In parallel, each agent maintains its own estimate of the gradient of the global cost function, by simultaneously incorporating a weighted average of its and its neighbors’ gradient estimates and its local gradient tracking estimate. The exchange of gradient estimates of the global cost function is performed over a columnstochastic weight matrix. Since doublystoachstic weights are nowhere used,
 is an attractive solution that is applicable to arbitrary, stronglyconnected graphs.The main contributions of this paper are as follows: (i) We show that, by choosing a sufficiently small constant stepsize, ,  converges linearly to a neighborhood of the global minimizer. This convergence guarantee is achieved for continuouslydifferentiable, stronglyconvex, local cost functions, where each agent is assumed to have access to a
and the gradient noise has zeromean and bounded variance.
(ii) We provide explicit expressions of the appropriate norms under which the row and columnstochastic weight matrices contract. With the help of these norms, we develop sharp and explicit convergence arguments.We now briefly review the literature concerning distributed and stochastic optimization. Early work on deterministic finitesum problems includes [7, 8, 9], while work on stochastic problems can be found in [10, 11]. Recently, gradient tracking has been proposed where the local gradient at each agent is replaced by the estimate of the global gradient [12, 13, 14, 15]. Methods for directed graphs that are based on gradient tracking [16, 17, 18, 19, 14, 15, 20, 21]
rely on separate iterations for eigenvector estimation that may impede the convergence. This issue was recently resolved in
[22, 23], see also [24, 25, 26, 27, 28] for the followup work, where eigenvector estimation was removed with the help of a unique approach that uses both row and columnstochastic weights. Ref. [22] derives linear convergence of the finitesum problem when the underlying functions are smooth and stronglyconvex, however, since arbitrary norms are used in the analysis, the convergence bounds are not sharp. Recent related work on timevarying networks and other approaches can be found in [29, 30, 31, 32, 33], albeit, without gradient tracking. Of significant relevance is [34], where a similar setup with gradient tracking is considered over undirected graphs. We note that  generalizes [34] and the analysis in [34] relies on the weight matrix contraction in norm that is not applicable here.We now describe the rest of the paper. Section II describes the problem, assumptions, and some auxiliary results. We present the convergence analysis in Section III and the main result in Section IV. Finally, Section V provides the numerical experiments and Section VI concludes the paper.
Basic Notation:
We use lowercase bold letters for vectors and uppercase italic letters for matrices. We use
for the identity matrix, and for the column of ones. For an arbitrary vector, , we denote its th element by and its smallest element by and its largest element by . Inequalities involving matrices and vectors are to be interpreted componentwise. For a matrix, , we denote as its spectral radius and as its infinite power (if it exists), i.e.,. For a primitive, rowstochastic matrix,
, we denote its left and right eigenvectors corresponding to the eigenvalue of
by and , respectively, such that and . Similarly, for a primitive, columnstochastic matrix, , we have .Ii Problem formulation and Auxiliary Results
Consider agents connected over a directed graph, , where is the set of agents, and
is the collection of ordered pairs,
, such that agent can send information to agent . We assume that . The agents solve the following problem:(1) 
where each is known only to agent . We now formalize the assumptions.
Assumption 1
Each local objective, , is stronglyconvex, i.e., and . Thus, we have
Under Assumption 1, the optimal solution for Problem P1 exists and is unique, which we denote as .
Assumption 2
Each local objective, , is smooth, i.e., its gradient is Lipschitzcontinuous: and , we have, for some ,
We make the following assumption on the agent communication graph, which guarantees the existence of a directed path from each agent to each agent .
Assumption 3
The graph, , is stronglyconnected.
We consider distributed iterative algorithms to solve Problem P1, where each agent is able to call a stochastic firstorder oracle (). At iteration and agent , given as the input, returns a stochastic gradient in the form of , where are random vectors, . The stochastic gradients, , satisfy the following standard assumptions:
Assumption 4
The set of random vectors are independent of each other, and

,

.
Assumption 4 is satisfied in many scenarios, for example, when the gradient noise,
, is independent and identically distributed (i.i.d.) with zeromean and finite second moment, while being independent of
. However, Assumption 4 allows for general gradient noise processes dependent on agent and the current iterate . Finally, we denote by the algebra generated by the set of random vectors .Iia The  algorithm
We now describe the proposed algorithm, , to solve Problem P1. Each agent maintains two state vectors, and , both in , where is the number of iterations. The variable is the estimate of the global minimizer , while is the global gradient estimator. The  algorithm, initialized with arbitrary ’s and with , , is given by the following:
(2a)  
(2b) 
where the weight matrices and are row and columnstochastic, respectively, and follow the graph topology, i.e., and , iff . We next write the algorithm in a compact vector form for the sake of analysis.
(3a)  
(3b) 
where we use the following notation:
and
IiB Auxiliary Results
We now provide some auxiliary results to aid the convergence analysis of . We first develop explicit norms regarding the contractions of the weight matrices, and . Since both and are primitive and stochastic, we use their non Perron vectors, and , respectively, to define two weighted inner products as follows: ,
The above inner products are welldefined because the Perron vectors, and , are positive and respectively induce a weighted Euclidean norm as follows: ,
We denote and as the matrix norms induced by and , respectively, i.e., , see [35],
(4)  
(5) 
It can be verified that the corresponding norm equivalence relationships between , , and are given by
We next establish the contraction of the and matrices with the help of the above arguments.
Lemma 1
For the matrices , and , we have:
(6)  
(7) 
with and .
The proof of Lemma. 1 is available in the Appendix. It can be further verified that
where
is the second largest singular value of a matrix.
In the following, Lemma 2 provides some simple results on the stochastic gradients, Lemma 3 uses the smoothness of the cost functions, while Lemmas 4 and 5 are standard in convex optimization and matrix analysis. To present these results, we define three quantities:
where The following statements use standard arguments and their formal proofs are omitted due to space limitations. Similar results can be found in [13, 22, 34].
Lemma 5 ([35])
Let be nonnegative and be a positive vector. If with , then .
Iii Convergence analysis
In this section, we analyze the  algorithm and establish its convergence properties for which we present Lemmas 69. The proofs for these lemmas are provided in the Appendix. First, in Lemma 6, we bound .
Next in Lemmas 79, we bound the following three quantities in expectation, conditioned on the algebra : (i) , the consensus error in the network; (ii) , the optimality gap; and, (iii) , the gradient tracking error. We then show that the norm of a vector composed of these three quantities converges linearly to a ball around the optimal when the stepsize is fixed and sufficiently small. The first lemma below is on the consensus error.
Lemma 7
Let Assumption 3 hold. Then the consensus error in the network follows:
(9) 
The next lemma is on the optimality gap.
Finally, we quantify the gradient tracking error.
Lemma 9
Let the hypotheses of Lemma 2 hold. The gradient tracking error follows:
(11) 
Iv Main Result
In this section, we analyze the inequality on to establish the convergence of .
Theorem 1
The goal is to find the range of such that . In the light of Lemma 5, it suffices to solve for the range of such that holds for some positive vector . We now expand this elementwise matrix inequality as follows:
which can be reformulated as follows:
(13)  
(14)  
(15) 
We now set as
(16) 
Then for (13) to hold, it suffices to require
(17) 
One can verify that with the choices of provided in (16), (14) holds. Lastly, for (15) to hold, we have
(18) 
Therefore, (17) and (18) together with the requirement that from Lemma 8 complete the proof. It is important to note that the error bounds in Theorem 1 go to zero as the stepsize gets smaller and the variance on the gradient noise decreases.
V Numerical Experiments
In this section, we illustrate the  algorithm and its convergence properties. We demonstrate the results on a directed graph generated using nearest neighbor rules with agents. The particular graph for the experiments is shown in Fig. 1
(left) to provide a sense of connectivity. We choose a logistic regression problem to classify around
images of two digits, and , labeled as or , from the MNIST dataset [37]. Each image, , is a dimensional vector and the total images are divided among the agents such that each agent has images. Because privacy and communication restrictions, the agents do not share their local batches (local training images) with each other. In order to use the entire data set for training, the network of agents cooperatively solves the following distributed logistic regression problem:where the private function at each agent, , is given by:
We show the performance of this classification problem over centralized and distributed methods. Centralized gradient descent (CGD) uses the entire batch, i.e., it computes
gradients at each iteration, whereas centralized stochastic gradient descent (CSGD) uses only one data point at each iteration that is uniformly sampled from the entire batch. For the distributed algorithms, we show the performance of nonstochastic
, where each agent uses its entire local batch, i.e., labeled data points. Whereas, for the implementation of , each agent uniformly chooses one data point from its local batch. For testing, we use additional images that were not used for training. The residuals are shown in Fig. 1 (right) while the training and testing accuracy is shown in Fig. 2. In the performance figures, the horizontal axis represents the number of epochs where each epoch represents computations on the entire batch. Clearly,
 has a better performance when compared to in [22] as expected from the performance of their centralized counterparts, CSGD and CGD.Vi Conclusions
In this paper, we have presented a stochastic gradient descent algorithm, , over arbitrary stronglyconnected graphs. In this setup, the data is distributed over agents and each agent uniformly samples a data point (from its local batch) at each iteration of the algorithm to implement the stochastic  algorithm. To cope with general directed communication graphs and potential lack of doublystochastic weights,  employs a twophase update with row and columnstochastic weights. We have shown that under a sufficiently small constant stepsize,  converges linearly to a neighborhood of the global minimizer when the local cost functions are smooth and stronglyconvex. We have presented numerical simulations based on realworld datasets to illustrate the theoretical results.
References
 [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communicationefficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
 [2] H. Raja and W. U. Bajwa, “Cloud KSVD: A collaborative dictionary learing algorithm for big, distributed data,” IEEE Trans. Signal Processing, vol. 64, no. 1, pp. 173–188, Jan. 2016.
 [3] A. Daneshmand, F. Facchinei, V. Kungurtsev, and G. Scutari, “Hybrid random/deterministic parallel algorithms for convex and nonconvex big data optimization,” IEEE Trans. on Signal Processing, vol. 63, no. 15, pp. 3914–3929, 2015.
 [4] D. Jakovetić, J. Xavier, and J. M. F. Moura, “Fast distributed gradient methods,” IEEE Trans. on Automatic Control, vol. 59, no. 5, pp. 1131–1146, May 2014.
 [5] F. Bullo, J. Cortés, and S. Martinez, Distributed control of robotic networks: A mathematical approach to motion coordination algorithms, Princeton University Press, 2009.
 [6] S. Lee and M. M. Zavlanos, “Approximate projection methods for decentralized optimization with functional constraints,” IEEE Trans. on Automatic Control, 2017.
 [7] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multiagent optimization,” IEEE Trans. on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.
 [8] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multiagent optimization with statedependent communication,” Mathematical Programming, vol. 129, no. 2, pp. 255–284, 2011.
 [9] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Trans. on Automatic Control, vol. 57, no. 3, pp. 592–606, Mar. 2012.
 [10] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,” Journal of optimization theory and applications, vol. 147, no. 3, pp. 516–545, 2010.
 [11] A. Nedić and A. Olshevsky, “Stochastic gradientpush for strongly convex functions on timevarying directed graphs,” IEEE Trans. on Automatic Control, vol. 61, no. 12, pp. 3936–3947, Dec. 2016.
 [12] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient methods for multiagent optimization under uncoordinated constant stepsizes,” in IEEE 54th Annual Conference on Decision and Control, 2015, pp. 2055–2060.
 [13] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. on Control of Network Systems, Apr. 2017.
 [14] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over timevarying graphs,” SIAM Journal of Optimization, vol. 27, no. 4, pp. 2597–2633, Dec. 2017.
 [15] C. Xi, R. Xin, and U. A. Khan, “ADDOPT: Accelerated distributed directed optimization,” IEEE Trans. on Automatic Control, Aug. 2017, in press.
 [16] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Pushsum distributed dual averaging for convex optimization,” in 51st IEEE Annual Conference on Decision and Control, Maui, Hawaii, Dec. 2012, pp. 5453–5458.
 [17] A. Nedić and A. Olshevsky, “Distributed optimization over timevarying directed graphs,” IEEE Trans. on Automatic Control, vol. 60, no. 3, pp. 601–615, Mar. 2015.
 [18] C. Xi, Q. Wu, and U. A. Khan, “On the distributed optimization over directed networks,” Neurocomputing, vol. 267, pp. 508–515, Dec. 2017.
 [19] C. Xi and U. A. Khan, “Distributed subgradient projection algorithm over directed graphs,” IEEE Trans. on Automatic Control, vol. 62, no. 8, pp. 3986–3992, Oct. 2016.
 [20] C. Xi, V. S. Mai, R. Xin, E. Abed, and U. A. Khan, “Linear convergence in optimization over directed graphs with rowstochastic matrices,” IEEE Trans. on Automatic Control, Jan. 2018, in press.
 [21] R. Xin, C. Xi, and U. A. Khan, “FROST – Fast rowstochastic optimization with uncoordinated stepsizes,” Arxiv: https://arxiv.org/abs/1803.09169, Mar. 2018.
 [22] R. Xin and U. A. Khan, “A linear algorithm for optimization over directed graphs with geometric convergence,” IEEE Control Systems Letters, vol. 2, no. 3, pp. 325–330, Jul. 2018.
 [23] S. Pu, W. Shi, J. Xu, and A. Nedić, “A pushpull gradient method for distributed optimization in networks,” in 57th IEEE Annual Conference on Decision and Control, Dec. 2018.
 [24] R. Xin and U. A. Khan, “Distributed heavyball: A generalization and acceleration of firstorder methods with gradient tracking,” arXiv preprint arXiv:1808.02942, 2018.
 [25] S. Pu, W. Shi, J. Xu, and A. Nedić, “Pushpull gradient methods for distributed optimization in networks,” https://arxiv.org/abs/1810.06653, 2018.
 [26] F. Saadatniaki, R. Xin, and U. A. Khan, “Optimization over timevarying directed graphs with row and columnstochastic matrices,” arXiv preprint arXiv:1810.07393, 2018.
 [27] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Secondorder guarantees of distributed gradient algorithms,” arXiv preprint arXiv:1809.08694, 2018.
 [28] R. Xin, D. Jakovetic, and U. A. Khan, “Distributed Nesterov gradient methods over arbitrary graphs,” IEEE Signal Processing Letters, Jan. 2019, Arxiv: 1901.06995.
 [29] D. Yuan, Y. Hong, D. W. C. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, Apr. 2018.
 [30] N. Denizcan Vanli, Muhammed O. Sayin, and Suleyman S. Kozat, “Stochastic subgradient algorithms for strongly convex optimization over distributed networks,” IEEE Trans. on Network Science and Engineering, vol. 4, no. 4, pp. 248–260, Oct. 2017.
 [31] D. Jakovetić, D. Bajović, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in IEEE Conference on Decision and Control, Dec. 2018, pp. 4238–4245.
 [32] A. K. Sahu, D. Jakovetić, D. Bajović, and S. Kar, “Distributed zeroth order optimization over random networks: A KieferWolfowitz stochastic approximation approach,” in Conference on Decision and Control, Dec. 2018, pp. 4951–4958.
 [33] D. Jakovetic, “A unification and generalization of exact distributed first order methods,” IEEE Trans. on Signal and Information Processing over Networks, 2018.
 [34] S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in 2018 IEEE Conference on Decision and Control (CDC), Dec. 2018, pp. 963–968.
 [35] R. A. Horn and C. R. Johnson, Matrix Analysis, 2 ed., Cambridge University Press, New York, NY, 2013.
 [36] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer Science & Business Media, 2013.
 [37] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2, pp. 18, 2010.
Proof of Lemma 1
We start with the proof of (6). Note that that leads to
By the definition of and Eq. (4), we have
where denotes the largest eigenvalue of the matrix and . What we need to show is that . Expanding , we get
With the fact that , it can be verified that and thus Furthermore, , and Since is primitive, by PerronFrobenius theorem [35], we have
Comments
There are no comments yet.