1 Introduction: Distributed Optimization and Its Limitations
First-order optimization methods, ranging from vanilla gradient descent to Nesterov acceleration and its many variants, have emerged over the past decade as the principal way to train Machine Learning (ML) models. There is a great need for techniques to train such models quickly and reliably in a distributed fashion over networks where the individual processors or GPUs may be scattered across the globe and communicate over an unreliable network which may suffer from message losses, delays, and asynchrony (see [1, 2, 3, 4, 5]).
Unfortunately, what often happens is that the gains from having many different processors running an optimization algorithm are squandered by the cost of coordination, shared memory, message losses and latency. This effect is especially pronounced when there are many processors and they are spread across geographically distributed data centers. As is widely recognized by the distributed systems community, “throwing” more processors at a problem will not, after a certain point, result in better performance.
This is typically reflected in the convergence time bounds obtained for distributed optimization in the literature. The problem formulation is that one must solve
over a network of nodes (see Figure 1 for an example).
Only node has knowledge of the function , and the standard assumption is that, at every step when it is awake, node can compute the gradient of its own local function . These functions are assumed to be convex. The problem is to compute this minimum in a distributed manner over the network based on peer-to-peer communication, possible message losses, delays, and asynchrony.
This relatively simple formulation captures a large variety of learning problems. Suppose each agent stores training data points , where
are vectors of features andare the associated responses (either discrete or continuous). We are interested to learn a predictive model , parameterized by parameters , so that for all . In other words, we are looking for a model that fits all the data throughout the network. This can be accomplished by empirical risk minimization
measures how well the parameter fits the data at node , with
being a loss function measuring the difference betweenand . Much of modern machine learning is built around such a formulation, including regression, classification, and regularized variants .
It is also possible that each agent does not have a static dataset, but instead collects streaming data points repetitively over time, where represents an unknown distribution of . In this case we can find through expected risk minimization
This paper is concerned with the current limitations of distributed optimization and how to get past them. To illustrate our main concern, let us consider the distributed subgradient method in the simplest possible setting, namely the problem of computing the median of a collection of numbers in a distributed manner over a fixed graph. Each agent in the network holds value , and the global objective is to find the median of . This can be incorporated in the framework of (1) by choosing
where denotes the stepsize at iteration , and are the weights agent assigns to agent ’s solutions: two agents and are able to exchange information if and only if ( otherwise). The weights are assumed to be symmetric. For comparison, the centralized subgradient method updates the solution at iteration according to
In Figure 2, we show the performance of Algorithm (4) as a function of the network size assuming the agents communicate over a ring network. As can be clearly seen, when the network size grows it takes a longer time for the algorithm to reach a certain performance threshold.
Clearly, this is an undesirable property. Glancing at the figure, we see that distributing computation over nodes can result in a convergence time on the order of . Few practitioners will be enthusiastic about distributed optimization if the final effect is vastly increased convergence time.
One might hope that this phenomenon, demonstrated for the problem of median computation – considered here because it is arguably the simplest problem to which one can apply the subgradient method – will not hold for the more sophisticated optimization problems in the ML literature. Unfortunately, most work in distributed optimization replicates this undesirable phenomenon. We next give an extremely brief discussion of known convergence times in the distributed setting (for a much more extended discussion, we refer the reader to the recent survey ).
We would like to confine our discussion to the following point: most known convergence times in the distributed optimization literature imply bounds of the form
where denotes the time for the decentralized algorithm on nodes to reach accuracy (error ), and is the time for the centralized algorithm which can query gradients per time step to reach the same level of accuracy. The function can usually be bounded in terms of some polynomial in the number of nodes .
For instance, in the subgradient methods, Corollary 9 of  gives that
are initial estimates,denotes the optimal solution and bounds the -norm of the subgradients. The function is the inverse of the spectral gap corresponding to the graph, and will typically grow with ; hence when is large, . In particular if the communication graphs are 1) path graphs, then ; 2) star graphs, then ; 3) geometric random graphs, then . in , but typically is at least .
By comparing and , we are keeping the computational power the same in both cases. Naturally, the centralized is always better: anything that can be done in a decentralized way could be done in a centralized way. The question, though, is how much better.
Framed in this way, the polynomial scaling in the quantity is extremely disconcerting. It is hard, for example, to argue that an algorithm should be run in a distributed manner with, say, if the quantity in Eq. (6) satisfies ; that would imply the distributed variant would be times slower than the centralized one with the same computational power.
Sometimes is written as the inverse spectral gap
in terms of the second-eigenvalue of some matrix. Because the second-smallest eigenvalue of an undirected graph Laplacian is approximatelyaway from zero, such bounds will translate into at least quadratic scalings with in the worst-case. Over time-varying -connected graphs, the best-known bounds on will be cubic in using the results of .
There are a number of caveats to the pessimistic argument outlined above. For example, in a multi-agent scenario where data sharing is not desirable or feasible, decentralized computation might be the only available option. Generally speaking, however, fast-growing will preclude the widespread applicability of distributed optimization. Indeed, returning to the back-of-the-envelope calculation above, if a user has to pay a multiplicative factor of 10,000 in convergence speed to use an algorithm, the most likely scenario is that the algorithm will not be used.
There is one scenario which avoids the pessimistic discussion above: when the underlying graph is an expander, the associated spectral gap is constant (see Chapter 6 of  for a definition of these terms as well as an explanation). In particular, on a random Erdos-Renyi random graph, the quantity
is constant with high probability (Corollary 9, part 9 in). Unfortunately, this is a very special case which will not occur in geographically distributed systems. By way of comparison, a random graph where nodes are associated with random locations, with links between nodes close together, will not have constant spectral gap and will thus have that grows with (Corollary 9, part 10 of ). The Erdos-Renyi graph escapes this because, if we again associate nodes with locations, the average link in the E-R graph is a “long range” one connecting nodes that are geographically far apart. By contrast, graphs built on geographic nearest-neighbor communications will not have constant spectral gaps.
2 Asymptotic Network Independence in Distributed Stochastic Optimization
In this paper, we provide a discussion of several recent papers which have obtained that, for a number of settings, , as long as is large enough. In other words, asymptotically, the distributed algorithm performs as well as a centralized algorithm with the same computational power.
We call this property asymptotic network independence: it is as if the network is not even there. Asymptotic network independence provides an answer to the concerns raised in the previous section.
. Here the problem to be solved is classification with a smooth support vector machine between overlapping clusters of points. The performance of the centralized algorithm is shown in orange, and the performance of the decentralized algorithm is shown in dark blue. The graph is a ring of 50 nodes, and the problem being solved is the search for a support vector classifier. The light blue line shows the disagreement among nodes.The graph illustrates the main result, which is that a network of 50 nodes performs as well in the limit as a centralized method with 50x the computational power of one node. Indeed, after iterations the orange and dark blue lines are almost indistinguishable.
We mention that similar simulations are available for other machine learning methods (training neural networks, logistic regression, elastic net regression, etc.). The asymptotic network independence property enables us to efficiently distribute the training process for a variety of existing learning methods.
The name “asymptotic network independence” is a slight misnomer, as we actually do not care if the asymptotic performance depends in some complicated way on the network. All we want is that the decentralized performance can be bounded by times the performance of the centralized method.
These results were developed in the papers [12, 13, 14, 11, 15] in the setting of distributed optimization of strongly convex functions in the presence of noise. The very first paper  contained the basic idea. By assuming sufficiently small constant stepsize, it approximated the distributed stochastic gradient method by a stochastic differential equation in continuous time. The work showed that the distributed method outperforms a centralized scheme with synchronization overhead. However, it did not lead to straightforward algorithmic bounds. The paper 
gave the first crisp statement of the relationship between centralized and distributed methods by means of a central limit theorem. It considered a general stochastic approximation setting which is not limited to strongly convex optimization; the proof proceeded based on certain technical properties of stochastic approximation methods. In our recent work, we generalized the results to graphs which are time-varying, with delays, message losses, and asynchrony. In a parallel recent work , a similar result was demonstrated with a further compression technique which allowed nodes to save on communication.
When the objective functions are not assumed to be convex, several recent works have obtained asymptotic network independence for distributed stochastic gradient descent. The work in  was the first to show that distributed algorithms could achieve a speedup like a centralized method when the number of computing steps is large enough. Such a result was generalized to the setting of directed communication networks in  for training deep neural networks, where the push-sum technique was combined with the standard distributed stochastic gradient scheme.
In the rest of this section, we will give a simple and readable explanation of the asymptotic network independence phenomenon in the context of distributed stochastic optimization over smooth and strongly convex objective functions. 111For more references on the topic of distributed stochastic optimization, the readers may refer to [17, 18, 19, 20, 21, 22, 23, 24, 25].
We are interested in minimizing Eq. (1) over a network of communicating agents. Regarding the objective functions we make the following standing assumption.
Each is -strongly convex with -Lipschitz continuous gradients, i.e., for any ,
has the following contraction property (see  Lemma 10).
For any and , we have
In other words, gradient descent with a small stepsize reduces the distance between the current solution and .
In the stochastic optimization setting, we assume each agent is able to obtain noisy gradient estimates that satisfy the following condition.
For all and , each random vector is independent, and
This assumption is satisfied for many distributed learning problems. For instance, in empirical risk minimization (2), the gradient estimation of can introduce noise from various sources, such as approximation and discretization errors. For another example, when minimizing the expected risk in (3), where independent data points are gathered over time,
is an unbiased estimator ofsatisfying Assumption 2.
The algorithm we discuss is the standard Distributed Stochastic Gradient Descent (DSGD) method. We let each agent in the network hold a local copy of the decision vector denoted by , and its value at iteration/time is written as . Denote for short. At each step , every agent performs the following update:
where is a sequence of nonnegative non-increasing stepsizes. The initial vectors are arbitrary for all , and is a mixing matrix.
DSGD belongs to the class of so-called consensus-based distributed optimization methods, where different agents mix their estimates at each iteration to reach a consensus of the solutions, i.e., for all and in the long run. To achieve consensus, the following condition is assumed on the mixing matrix and the communication topology among agents.
The graph of agents is undirected and connected (there exists a path between any two agents). The mixing matrix is nonnegative, symmetric and doubly stochastic, i.e., and , where is the all one vector. In addition, for some .
Some examples of undirected connected graphs are presented in Figure 4 below.
Because of Assumption 3, the mixing matrix has an important contraction property.
Let Assumption 3 hold, and let denote the eigenvalues of the matrix . Then, and
for all , where .
As a result, when running a consensus algorithm (which is just (9) without gradient descent)
the speed of reaching consensus is determined by . In particular, if we adopt the so-called lazy Metropolis rule for defining the weights, the dependency of on the network size is upper bounded by for some constant .
Lazy Metropolis rule for constructing :
Despite the fact that may be very close to with large , the consensus algorithm (10) enjoys geometric convergence speed, i.e.,
By contrast, the optimal rate of convergence for any stochastic gradient methods is sublinear, asymptotically (see ). This difference suggests that a consensus-based distributed algorithm for stochastic optimization may match the centralized methods in the long term: any errors due to consensus will decay at a fast-enough rate so that they ultimately do not matter.
In what follows, we discuss and compare the performance of the centralized stochastic gradient descent (SGD) method and DSGD. We will show that both methods asymptotically converge at the rate . Furthermore, the time needed for DSGD to approach the asymptotic convergence rate turns out to scale as .
2.2 Centralized Stochastic Gradient Descent (SGD)
The benchmark for evaluating the performance of DSGD is the centralized stochastic gradient descent (SGD) method, which we now describe. At each iteration , the following update is executed:
where stepsizes satisfy
i.e., is the average of noisy gradients evaluated at (by utilizing gradients at each iteration, we are keeping the computational power the same for SGD and DSGD). As a result, the gradient estimation is more accurate than using just one gradient. Indeed, from Assumption 2 we have
We measure the performance of SGD by , the expected squared distance between the solution at time and the optimal solution. Theorem 1 characterizes the convergence rate of , which is optimal for such stochastic gradient methods (see [27, 28]).
2.3 Distributed Stochastic Gradient Descent (DSGD)
We assume the same stepsize policy for DSGD and SGD. To analyze DSGD starting from Eq. (9), define
as the average of all the iterates in the network. Differently from the analysis for SGD, we will be concerned with two error terms. The first term , called the expected optimization error, defines the expected squared distance between and , and the second term , called the expected consensus error, measures the dissimilarities of individual estimates among all the agents. Given any individual iterate , its squared distance to the optimum is bounded by . Hence exploring the two terms will provide us with insights into the performance of DSGD. To simplify notation, denote
Inspired by the analysis for SGD, we first look for an inequality that bounds , which is analogous to in SGD. One such relation turns out to be :
Comparing (17) to (14), we find two additional terms on the right-hand side of the inequality. Both terms involve the expected consensus error , thus reflecting the additional disturbances caused by the dissimilarities of solutions. Relation (17) also suggests that the convergence rate of can not be better than for SGD, which is expected. Nevertheless, if decays fast enough compared to , it is likely that the two additional terms are negligible in the long run, and we would guess that the convergence rate of is comparable to for SGD.
This indeed turns out to be the case, as it is shown in  that when , we have that
In other words, we have the network independence phenomenon: after a transient, DSGD performs comparably to a centralized stochastic gradient descent method with the same computational power (e.g., which can query the same number of gradients per step as the entire network).
2.4 Numerical Illustration
We provide a numerical example to illustrate the asymptotic network independence property of DSGD. Consider the on-lineRidge regression problem
where is a penalty parameter. Each agent collects data points in the form of continuously over time with representing the features and being the observed outputs. Suppose each
is uniformly distributed, andis drawn according to , where are predefined parameters uniformly situated in , and
are independent Gaussian random variables with mean
and variance. Given a pair , agent can compute an estimated gradient of : , which is unbiased. Problem (18) has a unique solution given by
In the experiments, we consider two instances. In the first instance, we assume agents constitute a random network for DSGD, where every two agents are linked with probability . In the second instance, we let agents form a grid network. We use Metropolis weights in both instances. The problem dimension is set to and , the zero vector, for all . The penalty parameter is set to and the stepsizes . For both SGD and DSGD, we run the simulations times and average the results to approximate the expected errors.
The performance of SGD and DSGD is shown in Figure 5. We notice that in both instances the expected consensus error for DSGD converges to faster than the expected optimization error as predicted from our previous discussion. Regarding the expected optimization error, DSGD is slower than SGD in the first (resp., ) iterations for random network (resp., grid network). But after that, their performance is almost indistinguishable. The difference in the transient times is due to the stronger connectivity (or smaller ) of the random network compared to the grid network.
In this paper, we provided a discussion of recent results which have overcome a key barrier in distributed optimization methods for machine learning. These results established an asymptotic network independence property, that is, asymptotically, the distributed algorithm performs comparable to a centralized algorithm with the same computational power. We explain the property by examples of training ML models and provide a short mathematical analysis.
Along the line of achieving asymptotic network independence in distributed optimization, there are various future research directions, including considering nonconvex objective functions, reducing communication costs and transient time, and using exact gradient information. We next briefly discuss these.
First, distributed training of deep neural networks - the state-of-the-art machine learning approach in many application areas - involves minimizing nonconvex objective functions which are different from the main objectives considered in this paper. This area is largely unexplored with a few recent works in [16, 3, 4].
In distributed algorithms, the costs associated with communication among the agents are often non-negligible and may become the main burden for large networks. It is therefore important to explore communication reduction techniques that do not sacrifice the asymptotic network independence property. The recent papers [4, 14] have touched upon this point.
When considering asymptotic network independence for distributed optimization, an important factor is the transient time to reach the asymptotic convergence rate, as it may take a long time before the distributed implementation catches up with the corresponding centralized method. In fact, as we have shown in Section 2.1, this transient time can be a function of the network topology and grows with the network size. Reducing the transient time is thus a key future objective.
Finally, while several recent works have established the asymptotic network independence property in distributed optimization, they are mainly constrained to using stochastic gradient information. If the exact gradient is available, will distributed methods still be able to compete with the centralized ones? As we know, centralized algorithms typically enjoy a faster convergence speed with exact gradients. For example, plain gradient descent achieves linear convergence for strongly convex and smooth objective functions. With the exception of , which considered a restricted range of smoothness/strong convexity parameters, results on asymptotic network independence in this setting are currently lacking.
-  Gesualdo Scutari, Francisco Facchinei, Lorenzo Lampariello, Stefania Sardellitti, and Peiran Song. Parallel and distributed methods for constrained nonconvex optimization-part ii: Applications in communications and machine learning. IEEE Transactions on Signal Processing, 65(8):1945–1960, 2016.
-  Theodora S Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Ch Paschalidis, and Wei Shi. Federated learning of predictive models from federated electronic health records. International Journal of Medical Informatics, 112:59–67, 2018.
-  Kevin Scaman, Francis Bach, Sébastien Bubeck, Laurent Massoulié, and Yin Tat Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2740–2749, 2018.
-  Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792, 2018.
-  Bicheng Ying, Kun Yuan, and Ali H Sayed. Supervised learning under distributed features. IEEE Transactions on Signal Processing, 67(4):977–992, 2018.
-  Ruidi Chen and Ioannis Ch Paschalidis. A robust learning approach for regression models based on distributionally robust optimization. The Journal of Machine Learning Research, 19(1):517–564, 2018.
-  Angelia Nedic, Alex Olshevsky, Asuman Ozdaglar, and John N Tsitsiklis. On distributed averaging algorithms and quantization effects. IEEE Transactions on Automatic Control, 54(11):2506–2517, 2009.
-  Alex Olshevsky. Linear time average consensus and distributed optimization on fixed graphs. SIAM Journal on Control and Optimization, 55(6):3990–4014, 2017.
-  Angelia Nedić, Alex Olshevsky, and Michael Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
-  Richard Durrett. Random Graph Dynamics, volume 200. Cambridge university press Cambridge, 2007.
-  Alex Olshevsky, Ioannis Ch Paschalidis, and Artin Spiridonoff. Robust asynchronous stochastic gradient-push: asymptotically optimal and network-independent performance for strongly convex functions. arXiv preprint arXiv:1811.03982, 2018.
-  Shi Pu and Alfredo Garcia. A flocking-based approach for distributed stochastic optimization. Operations Research, 66(1):267–281, 2017.
-  Gemma Morral, Pascal Bianchi, and Gersende Fort. Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017.
-  Anastasia Koloskova, Sebastian U Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. arXiv preprint arXiv:1902.00340, 2019.
-  Alex Olshevsky, Ioannis Ch Paschalidis, and Shi Pu. A non-asymptotic analysis of network independence for distributed stochastic gradient descent. arXiv preprint arXiv:1906.02702, 2019.
-  Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5336–5346, 2017.
-  Zaid J Towfic and Ali H Sayed. Adaptive penalty-based distributed stochastic convex optimization. IEEE Transactions on Signal Processing, 62(15):3924–3938, 2014.
-  Nikolaos Chatzipanagiotis and Michael M Zavlanos. A distributed algorithm for convex constrained optimization under noise. IEEE Transactions on Automatic Control, 61(9):2496–2511, 2016.
-  Angelia Nedić and Alex Olshevsky. Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Transactions on Automatic Control, 61(12):3936–3947, 2016.
-  Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, pages 1–48, 2017.
-  Shi Pu and Alfredo Garcia. Swarming for faster convergence in stochastic optimization. SIAM Journal on Control and Optimization, 56(4):2997–3020, 2018.
-  Muhammed O Sayin, N Denizcan Vanli, Suleyman S Kozat, and Tamer Başar. Stochastic subgradient algorithms for strongly convex optimization over distributed networks. IEEE Transactions on Network Science and Engineering, 4(4):248–260, 2017.
-  Benjamin Sirb and Xiaojing Ye. Decentralized consensus algorithm with delayed and stochastic gradients. SIAM Journal on Optimization, 28(2):1232–1254, 2018.
-  Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. arXiv preprint arXiv:1805.11454, 2018.
-  Sulaiman A Alghunaim and Ali H Sayed. Distributed coupled multi-agent stochastic optimization. IEEE Transactions on Automatic Control, 2019.
-  Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 2017.
-  Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
-  Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578. Omnipress, 2012.
-  Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. arXiv preprint arXiv:1704.07807, 2017.