Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up

05/08/2019 ∙ by Dominic Richards, et al. ∙ University of Oxford 0

We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations that scales, up to a threshold, with the inverse of the spectral gap of the gossip matrix divided by the number of samples owned by each agent raised to a problem-dependent power. The presence of the threshold comes from statistics. It encodes the existence of a "big data" regime where the number of required iterations does not depend on the network topology. In this regime, Distributed Gradient Descent achieves optimal statistical rates with the same order of iterations as gradient descent run with all the samples in the network. Provided the communication delay is sufficiently small, the distributed protocol yields a linear speed-up in runtime compared to the single-machine protocol. This is in contrast to decentralised optimisation algorithms that do not exploit statistics and only yield a linear speed-up in graphs where the spectral gap is bounded away from zero. Our results exploit the statistical concentration of quantities held by agents and shed new light on the interplay between statistics and communication in decentralised methods. Bounds are given in the standard non-parametric setting with source/capacity assumptions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In machine learning a canonical goal is to use training data sampled independently from an unknown distribution to fit a model that performs well on unseen data from the same distribution. With a loss function measuring the performance of a model on a data point, a common approach is to find a model that minimises the average loss on the training data with some form of

explicit regularisation to control model complexity and avoid overfitting. Due to the increasingly large size of datasets and high model complexity, direct minimisation of the regularised problem is posing more and more computational challenges. This has led to growing interest in approaches that improve models incrementally using gradient descent methods [8], where model complexity is controlled through forms of implicit/algorithmic regularisation such as early stopping and step-size tuning [54, 55, 26].

The growth in the size of modern datasets has also meant that the coordination of multiple machines is often required to fit machine learning models. In the centralised server-clients setup, a single machine (server) is responsible to aggregate and disseminate information to other machines (clients) in what is an effective star topology. In some settings, such as ad-hoc wireless and peer-to-peer networks, network instability, bandwidth limitation and privacy concerns make centralised approaches less feasible. This has motivated research into scalable methods that can avoid the bottleneck and vulnerability introduced by the presence of a central authority. Such solutions are called “decentralised”, as no single entity is responsible for the collection and dissemination of information: machines communicate with neighbours in a network structure that encodes communication channels.

Since the early works [49, 50] to the more recent work [21, 33, 32, 22, 28, 29, 10, 17, 46, 30], problems in decentralised multi-agent optimisation have often been treated as a particular instance of consensus optimisation. In this framework, a network of machines or agents collaborate to minimise the average of functions held by individual agents, hence “reaching consensus” on the solution of the global problem. In this setting the performance of the chosen protocol naturally depends on the network topology, since to solve the problem each agent has to communicate and receive information from all other agents. In particular, the number of iterations required by decentralised iterative gradient methods typically scales with the inverse of the spectral gap of the communication matrix (a.k.a. gossip or consensus matrix) [17, 41, 42], which reflects the performance of gossip protocols in the problem of distributed averaging [9, 16, 43, 4].

Many distributed machine learning problems, in particular those involving empirical risk minimisation, have been framed in the context of consensus optimisation. However, as highlighted in [45] and more recently in [37], often these problems have more structure than consensus optimisation due to the statistical regularity of the data. When the agents’ functions are the empirical risk of their local data, in the setting where the local data comes from the same unknown distribution (homogeneous setting), the functions held by each agent are similar to one another by the phenomenon of statistical concentration. In particular, in the limit of an infinite amount of data per agent, the local functions are the same and agents do not need to communicate to solve the problem. This phenomenon highlights the existence of a natural trade-off between statistics and communication. While statistical similarities of local objective functions and the statistics/communication trade-off have been investigated and exploited in centralised server-clients setup, typically in the analysis and design of divide-and-conquer schemes [57, 27, 19, 31, 25, 1, 59, 45, 44, 58, 2], only recently there has been some investigation into the interplay between statistics and communication/network-topology in the decentralised setting. The authors in [6]

investigate the interplay between the spectral norm of the data-generating distribution and the inverse spectral gap of the communication matrix for Distributed Stochastic Gradient Descent in the case of strongly convex losses. As most of the literature on decentralised machine learning, this work also focuses on minimising the training error and not the test/prediction error (numerical experiments are given for the test error). Some works have investigated the performance on the test loss in the single-pass/online stochastic setting where agents use each data point only once. The authors in

[36] investigate a distributed regularised online learning setting [52] and obtain guarantees for a “multi-step” Distributed Stochastic Mirror Descent algorithm where agents reach consensus on their stochastic gradients in-between computation steps. The works [24] and [3] consider the performance of Distributed Stochastic Gradient Descent algorithms in the non-convex smooth case. They investigate the average performance of the agents over the network in terms of convergence to a stationary point of the test loss [18] and show that a linear speed-up in computational time can be achieved provided the number of samples seen, equivalently the number of iterations performed, exceeds the network size times the inverse of the spectral gap, each raised to a certain power. The work [37]

seems to be the first to have considered minimisation of the test error in the multi-pass/offline stochastic setting that more naturally relates to the classical literature on consensus optimisation. The authors investigate stability of Distributed Stochastic Gradient Descent on the test error and show that for smooth and convex losses the number of iterations required to achieve optimal statistical rates scales with the inverse of the spectral gap of the gossip matrix, a term that captures the noise of the gradients’ estimates, and a term that controls the statistical proximity of the local empirical losses.

1.1 Contributions

In this work we investigate the implicit-regularisation learning performance of full-batch Distributed Gradient Descent [32] on the test error in the context of non-parametric regression with the square loss function. In the homogeneous setting where agents hold independent and identically distributed data points, we investigate the choice of step size and number of iterations that guarantee each agent to achieve optimal statistical rates with respect to all the samples in the network. We build a theoretical framework that allows to directly and explicitly exploit the statistical concentration of quantities (i.e. batched gradients) held by agents. On the one hand, exploiting concentration yields savings on computation, i.e. it allows to achieve faster convergence rates compared to methods that do not exploit concentration in their parameter tuning. On the other hand, it yields savings on communication, as it allows to take advantage of the trade-off between statistical power and communication costs. Firstly, we show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates up to poly-logarithmic factors with a number of iterations that scales with the inverse of the spectral gap of the communication matrix divided by the number of samples owned by each agent raised to a problem-dependent power, up to a statistics-induced threshold. Previous results for decentralised iterative gradient schemes in the context of consensus optimisation do not take advantage of the statistical nature of decentralised empirical risk minimisation problems. In the statistical setting that we consider, these methods would require a larger number of iterations that scales only with respect to the inverse of the spectral gap. Secondly, we show that if agents additionally hold sufficiently many samples with respect to the inverse of the spectral gap, then the same order of iterations allows Distributed Gradient Descent and Single-Machine Gradient Descent (i.e. gradient descent run on a single machine that holds all the samples in the network) to achieve optimal statistical rates up to poly-logarithmic factors. Provided the communication delay is sufficiently small, this yields a linear speed-up in runtime over Single-Machine Gradient Descent, with a “single-step” method that performs a single communication round per local gradient descent step. Single-step methods that do not exploit concentration can only achieve a linear speed-up in runtime in graphs with spectral gap bounded away from zero, i.e. expanders or the complete graph. Our results demonstrate how the increased statistical similarity between the local empirical risk functions can make up for a decreased connectivity in the graph topology, showing that a linear speed-up in runtime can be achieved in any graph topology by exploiting concentration. To the best of our knowledge, we seem to be the first to isolate this type of phenomena.

We prove our results under the standard “source” and “capacity” assumptions in non-parametric regression. These assumptions relate, respectively, to the projection of the optimal predictor on the hypothesis space and to the effective dimension of this space [56, 12]. A contribution of this work is to show that proper tuning yields, up to poly-logarithmic terms, optimal non-parametric rates in decentralised learning. As far as we aware, in the distributed setting such guarantees have been established only for centralised divide-and-conquer methods [57, 27, 19, 31, 25].

To prove our results we build upon previous work for Single-Machine Gradient Descent applied to non-parametric regression, in particular the line of works [54, 39, 26]

. Exploiting that in our setting the iterates of Distributed Gradient Descent can be written in terms of products of linear operators depending on the data held by agents, we decompose the excess risk into bias and sample variance terms for Single-Machine Gradient Descent plus an additional quantity that captures the error incurred by using a decentralised protocol over the communication network. We analyse this network error term by further decomposing it into a term that behaves similarly to the consensus error previously considered in

[17, 32], and a new higher-order term. We control both terms by using the structure of the gradient updates, which allows us to analyse the interplay between statistics, via concentration, and network topology, via mixing of random walks related to the gossip matrix.

The work is structured as follows. Section 2 presents the setting, assumptions, and algorithm that we consider. Section 3 states the main convergence result and discusses implications from the point of view of statistics, computation and communication. Section 4 presents the error decomposition into bias, variance, and network error, and it illustrates the implicit regularisation strategy that we adopt. Section 5 highlights some of the features of our contribution in the light of future research directions. The appendix is structured as follows. Section A includes some remarks about our results. Section B illustrates the main scheme of the proofs, highlighting the interplay between statistics and network topology. Section C contains the full details of the proofs.

2 Setup

In this section we describe the learning problem, assumptions and algorithm that we consider.

2.1 Learning problem: decentralised non-parametric least-squares regression

We adopt the setting used in [39, 26], which involves regression in abstract Hilbert spaces. This setting is of relevance for applications related to the Reproducing Kernel Hilbert Space (RKHS). See the work in [54] and references therein.

Let be a separable Hilbert Space with inner product and induced norm denoted by and , respectively. Let be the input space and be the output space. Let

be an unknown probability measure on

, be the marginal on , and be the conditional distribution on given . Assume that there exists a constant so that

(1)

Let the network of agents be modelled by a simple, connected, undirected, finite graph , with nodes joined by edges . Edges represent communication constraints: agents can only communicate if they share an edge . We consider the homogeneous setting where each agent is given data points sampled independently from , where and , and each pair is sampled from . The problem under study is the minimisation of the test/prediction risk with the square loss:

(2)

The quality of an approximate solution is measured by the excess risk

Notation

Given a matrix , let denote the element and denote the row. Let denote orders of magnitudes up to constants in and , and denote orders of magnitudes up to both constants and poly-logarithmic terms in and . Let denote inequalities and equalities modulo constants and poly-logarithmic terms in . We use the notation and .

2.2 Assumptions

The assumptions that we consider are standard in non-parametric regression [26, 34]

. The first assumption is a control on the even moments of the response.

Assumption 1

There exist and such that we have for any .

Let be the Hilbert space of square-integrable functions from to with respect to , with norm . Let be the operator defined as . Under Assumption 1 the operator can be proved to be in the class of positive trace operators [14], and therefore the -th power , with , can be defined by using spectral theory. Let us also define the operator as and its operator norm . The function minimising the expected squared loss (2) over all measurable functions is known to be the conditional expectation for . Let be the hypothesis space that we consider. The optimal may not be in as under Assumption 1 the space of functions searched is a subspace of . Let denote the projection of onto the closure of in . Searching for a solution to (2) is equivalent to searching for a linear function in that approximates .
The following assumption quantifies how well the target function can be approximated in .

Assumption 2

There exist and such that .

This assumption is often called the “source” condition [12]. Representing

in the eigenspace of

, this condition can be related to the rate at which the coefficients of this representation decay. The bigger is, the faster the decay, and more stringent the assumption is. In particular, if then the target function is in the hypothesis space . The last assumption is on the capacity of the hypothesis space.

Assumption 3

There exist such that for all .

Assumption 3 relates to the effective dimension of the underlying regression problem [56, 12] and is often called the “capacity” assumption. This assumption is always satisfied for and since is a trace class operator. This case is called the capacity-independent setting. Meanwhile, this assumption is satisfied for

if, for instance, the eigenvalues of

, denoted by , decay sufficiently quickly, i.e. . This case allows improved rates to be obtained. For more details on the interpretation of these assumptions we refer to the work in [39, 26, 34].

2.3 Algorithm: distributed gradient descent

We now describe the Distributed Gradient Descent algorithm [32] and its application to the problem of non-parametric regression. Let

be a symmetric doubly-stochastic matrix, i.e. 

and where

is the vector of all ones. Let

be supported on the graph, i.e. for any , only if . The matrix encodes local averaging on the network: when each agent has a real number represented by the vector , the vector for encodes what each agent computes after taking a weighted average of its own and neighbours’ numbers. Distributed Gradient Descent is implemented by communication on the network through the gossip matrix . Initialised at for , the iterates of the Distributed Gradient Descent are defined as follows, for and :

(3)

where is the sequence of positive step sizes. The iterates (3) can be seen as a combination of two steps: first, each agent performs a local gradient descent step ; second, each agent performs local averaging through the consensus step . We treat gradient descent as a statistical device. We are interested in tuning the parameters of the algorithm to bound the expected value of the excess risk where denotes expectation with respect to the data .

Network dependence

Let be the second largest eigenvalue in magnitude of the communication matrix . Specifically, given the spectral decomposition of the gossip matrix where are the ordered real eigenvalues of and

the associated eigenvectors, we have

. In many settings, the spectral gap scales with the size of the network raised to a certain power depending on the topology. For instance, supposing is a finite regular graph and the communication matrix is the random walk matrix, then the inverse of the spectral gap scales as for a complete graph, for a grid, and for a cycle [13, 23, 17]. The question of designing gossip matrices that yield better (smaller) scaling for the quantity has been investigated [53], and it has been found numerically that the rates mentioned above can not be improved unless lifted graphs are considered [43].

3 Main result: optimal statistical rates with linear speed-up in runtime

We now state and highlight the main contribution of this work in the context of decentralised statistical optimisation. The result that we are about to state in Theorem 3 showcases the interplay between statistics and communication that arise from the statistical regularities of the problem. This result shows the existence of a “big data” regime where Distributed Gradient Descent can achieve a linear (in the number of agents ) speed-up in runtime compared to Single-Machine Gradient Descent.

Let Assumptions 1, 2, 3 hold with and . Let be the smallest integer greater than the quantity

Let . If and , then :

where depends on , and polynomials of and .

Theorem 3 shows that when agents are given sufficiently many samples () with respect to the number of agents (), , proper tuning of the step size and number of iterations (a form of implicit regularisation) allows Distributed Gradient Descent to recover the optimal statistical rate for [12] up to poly-logarithmic terms.

Single-Machine Gradient Descent run on all of the observations has been previously shown to reach optimal statistical accuracy with a number of iterations [26]. The number of iterations prescribed by Theorem 3 scales like times a network-dependent factor that is a function of the inverse of the spectral gap . The fact that the number of iterations required to reach a prescribed level of error accuracy is inversely proportional to the spectral gap is a standard feature of iterative gradient methods applied to generic decentralised consensus optimisation problems [17, 41, 42]. This dependence encodes the fact that in the case of generic objective functions assigned to agents, agents have to

share information with everyone to solve the global problem and minimise the sum of the local functions; hence, more iterations are required in graph topologies that are less well-connected. In the present homogeneous setting, however, the statistical nature of the problem allows to exploit concentration of random variables to characterise the existence of a (network-dependent) “big data” regime where the number of iterations does

not depend on the network topology. The trade-off between statistics and communication is encoded by the dependence of the tuning parameters (stopping time and step size) on the number of samples assigned to each agent. Observe that the factor is a decreasing function of , up to the threshold . When this factor becomes and Theorem 3 guarantees that the same order of iterations allows both Distributed and Single-Machine Gradient Descent to achieve the optimal statistical rates up to poly-logarithmic factors. This regime represents the case when the increased statistical similarity between the local empirical risk functions assigned to each agent (increasing as a function of

, as described by the non-asymptotic Law of Large Numbers) makes up for the decreased connectivity in the graph topology (typically decreasing with the spectral gap

) to yield a linear speed-up in runtime over Single-Machine Gradient Descent when the communication delay between agents is sufficiently small. See Section 3.1 below.

The result of Theorem 3 depends on some other requirements which we now briefly discuss. The requirement is technical and arises from the need to perform sufficiently many iterations to reach the mixing time of the gossip matrix , i.e. . Noting that the number of iterations depends on the number of agents, samples and spectral gap. The requirement relates to the difficulty of the estimation problem and is stronger than a similar condition seen for single-machine gradient methods where , see for instance the works [26, 34]. This requirement, alongside , ensures that the higher-order error terms arising from considering a decentralised protocol decay sufficiently quickly with respect to the number of samples owned by agents . The condition can be removed if the covariance operator is assumed to be known to agents, which aligns with the additive noise oracle in single-pass Stochastic Gradient Descent [15] or fixed-design regression in finite-dimensional settings [20]. The condition corresponds to the case when the rate of concentration of the batched gradients held by agents (i.e. ) is faster than the optimal statistical rate, i.e. . This condition becomes more stringent (i.e. more data per agent is needed) as the problem becomes easier from a statistical point of view and and increase (see discussion in Section 2.2). This is due to the fact that as and increase, only the statistical rate improves while the rate of concentration in the network error stays the same, implying that more data is needed to balance the two terms.

3.1 Linear speed-up in runtime

Let gradient computations cost unit of time and communication delay between agents be units of time. Denote the number of iterations required by Single-Machine Gradient Descent and Distributed Gradient Descent to achieve the optimal statistical rate by and , respectively. The speed-up in computational time obtained by running the distributed protocol over the single-machine protocol is of the order , where is the maximum degree of the communication matrix . Theorem 3 implies that when then , and if grows as then the speed-up in computational time is of order , linear in the number of agents. Classical “single-step” decentralised methods that alternate single communication rounds per local gradient computation, such as the methods inspired by [32], do not exploit concentration and have a runtime that scales with the inverse of the spectral gap, without any threshold. As a result, these methods only yield a linear speed-up in graphs with spectral gap bounded away from zero, i.e. expanders or the complete graph. See below for more details. On the other hand, “multi-step” methods that alternate multiple communication rounds per local gradient computation, such as the ones considered in [36, 41, 42], display a runtime that scales with a factor of the form in our setting. Thus, while these methods can achieve a linear speed-up in any graph topology in the “big data” regime without exploiting concentration, they require an additional amount of communication rounds that is network-dependent and scales with the inverse of the spectral gap. For a cycle graph, for instance, this means an extra communication steps per iteration (or for gossip-accelerated methods). Hence, classical decentralised optimisation methods that do not exploit concentration suffer from a trade-off between runtime and communication cost: if you reduce the first you increase the second, and viceversa. Our results show that single-step methods can achieve a linear speed-up in runtime in any graph topology by exploiting concentration: statistics allows to find a regime where it is possible to simultaneously have a linear speed-up in runtime without increasing communication.

Comparison to single-step decentralised methods that do not exploit concentration

Decentralised optimisation methods that do not consider statistical concentration rates in their parameter tuning can not exploit the statistics/communication trade-off encoded by the presence of the factor in Theorem 3, and they typically require a smaller step size and more iterations to achieve optimal statistical rates. The convergence rate typically achieved by classical consensus optimisation methods, e.g. [17], is recovered in Theorem 3 when as in this case the number of iterations required becomes , which corresponds to scaled by a certain power of (in our setting the power is ). This represents the setting where the choice of step size aligns with the choice in the single-machine case scaled by , and a linear speed-up occurs when . Since the network error is decreasing in in our case (due to concentration), larger step sizes can be chosen for . Specifically, the single-machine step size is now scaled by , yielding a linear speed-up when , which, as increases, is a weaker requirement on the network topology over the standard consensus optimisation setting.

4 General result: error decomposition and implicit regularisation

Theorem 3 is a corollary of the next result, which explicitly highlights the interplay between statistics and network topology and the implicit regularisation role of the step size and number of iterations. Let Assumptions 1, 2, 3 hold with . Let with and . If , then for all , and :

(4)
(5)
(6)

where are all constants depending on .

The bound in Theorem 4 shows that the excess risk has been decomposed into three main terms, as detailed in Section B.1. The first term (4) corresponds to the error achieved by Single-Machine Gradient Descent run on all samples. It consists of both bias and sample variance terms [26]. The second two terms (5) and (6) characterise the network error due to the use of a decentralised protocol. These terms decrease with the number of samples owned by each agent. This captures the fact that, as agents are given samples from the same unknown distribution, agents are in fact solving the same learning problem and their local empirical loss functions concentrate to the same objective as increases. The decentralised error term is itself composed of two terms which decay at different rates with respect to . The term in (5) is dominant and decays at the order of . This can be interpreted as the consensus error seen in the works [32, 17] for instance. As in that setting, this quantity is also increasing with the step size and decreasing with the spectral gap of the communication matrix , as encoded by . The term (6) decays at the faster rate of . This is a higher-order error term that is not appearing in the error decomposition when the covariance operator is assumed to be known to agents. This quantity arises from the interaction between the local averaging on the network through and what has been previously labelled as the “multiplicative” noise in the single-machine single-pass stochastic gradient setting for least squares [15], i.e. the empirical covariance operator interacting with the iterates at each step. Section B.2 provides a high-level illustration of the analysis of the Network Error terms (5) and (6).

The bound in Theorem 4 shows how the algorithmic parameters, step size and number of iterations, act as regularisation parameters for Distributed Gradient Descent, following what is seen in the single-machine setting. Theorem 3 demonstrates how optimal statistical rates can be recovered by tuning these parameters appropriately with respect to the network topology, network size, number of samples, and with respect to the estimation problem itself. The bound in Theorem 3 is obtained from the bound in Theorem 4 by first tuning the quantity to the order so that the bias and variance terms in (4

) achieve the optimal statistical rate. This leaves the tuning of the remaining degree of freedom (say

) to ensure that also the network error achieves the optimal statistical rate. The high-level idea is the following. As increases, the network error is dominated by the term in (5) that is proportional to the factor . There are two ways to choose the largest possible step size to guarantee that this factor is , depending on whether the rate of concentration of the batched gradients held by agents is faster than the optimal statistical rate or not, i.e., whether is true or not (cf. Section 3). The two cases yield the factors and in Theorem 3, corresponding to the choice and , respectively. If the concentration of the batched gradients held by agents fully compensates for the network error, i.e. , then with a constant step size and , yielding the regime where a linear speed-up occurs. For more details on the parameters , see Lemma C.3.1 in Appendix C.3.1.

5 Future directions

We highlight some of the features of our contribution and outline directions for future research.

Non-parametric setting

We prove bounds in the attainable case . The non-attainable case is known to be more challenging [26], and it is natural to investigate to what extent our results can be extended to that setting. We consider the case which does not include the finite-dimensional setting , , where the optimal rate is [51]. While adapting our results to this setting requires minor modifications, optimal bounds would only hold for “easy” estimation problems with due to the higher-order term in the network error. Improvements require getting better bounds on this term, potentially using a different learning rate.

General loss functions

The analysis that we develop is specific to the square loss, which yields the bias/variance error decomposition and allows to get explicit characterisations by expanding the squares. While the concentration phenomena that we exploit are generic, different techniques are required to extend our analysis to other losses, as in the single-machine setting. The statistical proximity of agents’ functions in the finite-dimensional setting has been investigated in [37].

Statistics/communication trade-off with sparse/randomised gossip

In this work we show that when agents hold sufficiently many samples, then Distributed and Single-Machine Gradient Descent achieve the optimal statistical rate with the same order of iterations. This motivates balancing and trading off communication and statistics, e.g., investigating statistically robust procedures in settings when agents communicate with a subset of neighbours, either deterministically or randomly [9, 16, 4].

Stochastic gradient descent and mini-batches

Our work exploits concentration of gradients around their means, so full-batch gradients (i.e. batches of size ) yield the concentration rate . In single-machine learning, stochastic gradient descent [38] has been shown to achieve good statistical performance in a variety of settings while allowing for computational savings. Extending our findings to stochastic methods with appropriate mini-batch sizes is another venue for future investigation.

Acknowledgments

We would like to thank Francis Bach, Lorenzo Rosasco and Alessandro Rudi for helpful discussions.

References

  • [1] Alekh Agarwal and John C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
  • [2] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in neural information processing systems, pages 1756–1764, 2015.
  • [3] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792, 2018.
  • [4] Florence Bénézit, Alexandros G Dimakis, Patrick Thiran, and Martin Vetterli. Order-optimal consensus through randomized path averaging. IEEE Transactions on Information Theory, 56(10):5150–5167, 2010.
  • [5] Raphaël Berthier, Francis Bach, and Pierre Gaillard. Accelerated Gossip in Networks of Given Dimension using Jacobi Polynomial Iterations. arXiv preprint arXiv:1805.08531, May 2018.
  • [6] Avleen S Bijral, Anand D Sarwate, and Nathan Srebro. Data-dependent convergence for consensus stochastic optimization. IEEE Transactions on Automatic Control, 62(9):4483–4498, 2017.
  • [7] Gilles Blanchard and Nicole Mücke. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18(4):971–1013, 2018.
  • [8] Olivier Bousquet and Léon Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2008.
  • [9] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
  • [10] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, January 2011.
  • [11] Ming Cao, Daniel A Spielman, and Edmund M Yeh. Accelerated gossip algorithms for distributed computation. In Proc. of the 44th Annual Allerton Conference on Communication, Control, and Computation, pages 952–959. Citeseer, 2006.
  • [12] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • [13] Fan R.K. Chung and Fan Chung Graham. Spectral graph theory. American Mathematical Soc., 1997.
  • [14] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007.
  • [15] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
  • [16] Alexandros DG Dimakis, Anand D Sarwate, and Martin J Wainwright. Geographic gossip: Efficient averaging for sensor networks. IEEE Transactions on Signal Processing, 56(3):1205–1216, 2008.
  • [17] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012.
  • [18] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • [19] Zheng-Chu Guo, Shao-Bo Lin, and Ding-Xuan Zhou. Learning theory of distributed spectral algorithms. Inverse Problems, 33(7):074009, 2017.
  • [20] László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
  • [21] Bjorn Johansson, Maben Rabi, and Mikael Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networks. In Decision and Control, 2007 46th IEEE Conference on, pages 4705–4710. IEEE, 2007.
  • [22] Björn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2009.
  • [23] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • [24] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
  • [25] Junhong Lin and Volkan Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectral-regularization algorithms. arXiv preprint arXiv:1801.07226, 2018.
  • [26] Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1–47, 2017.
  • [27] Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least squares. The Journal of Machine Learning Research, 18(1):3202–3232, 2017.
  • [28] Ilan Lobel and Asuman Ozdaglar. Distributed subgradient methods for convex optimization over random networks. IEEE Transactions on Automatic Control, 56(6):1291–1306, 2011.
  • [29] Ion Matei and John S Baras. Performance evaluation of the consensus-based distributed subgradient method under random communication topologies. IEEE Journal of Selected Topics in Signal Processing, 5(4):754–771, 2011.
  • [30] Aryan Mokhtari and Alejandro Ribeiro. Dsa: Decentralized double stochastic averaging gradient algorithm. Journal of Machine Learning Research, 17(61):1–35, 2016.
  • [31] Nicole Mücke and Gilles Blanchard. Parallelizing spectrally regularized kernel algorithms. The Journal of Machine Learning Research, 19(1):1069–1097, 2018.
  • [32] Angelia Nedić, Alex Olshevsky, Asuman Ozdaglar, and John N. Tsitsiklis. On distributed averaging algorithms and quantization effects. IEEE Transactions on Automatic Control, 54(11):2506–2517, 2009.
  • [33] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • [34] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, pages 8125–8135. 2018.
  • [35] IF Pinelis and AI Sakhanenko. Remarks on inequalities for large deviation probabilities. Theory of Probability & Its Applications, 30(1):143–148, 1986.
  • [36] M. Rabbat. Multi-agent mirror descent for decentralized stochastic optimization. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 517–520, Dec 2015.
  • [37] D. Richards and P. Rebeschini. Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent. ArXiv e-prints, sep 2018.
  • [38] Herbert Robbins and Sutton Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102–109. Springer, 1985.
  • [39] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630–1638, 2015.
  • [40] Ali H. Sayed. Adaptive networks. Proceedings of the IEEE, 102(4):460–497, 2014.
  • [41] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3027–3036. JMLR. org, 2017.
  • [42] Kevin Scaman, Francis Bach, Sébastien Bubeck, Laurent Massoulié, and Yin Tat Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2745–2754, 2018.
  • [43] Devavrat Shah. Gossip algorithms. Foundations and Trends® in Networking, 3(1):1–125, 2009.
  • [44] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems, pages 163–171, 2014.
  • [45] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on, pages 850–857. IEEE, 2014.
  • [46] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
  • [47] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
  • [48] Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence. IEEE Trans. Information Theory, 60(9):5716–5735, 2014.
  • [49] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):803–812, 1986.
  • [50] John Nikolas Tsitsiklis. Problems in decentralized decision making and computation. Technical report, Massachusetts Inst Of Tech Cambridge Lab For Information And Decision Systems, 1984.
  • [51] Alexandre B Tsybakov. Optimal rates of aggregation. In Learning Theory and Kernel Machines, pages 303–313. Springer, 2003.
  • [52] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
  • [53] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems and Control Letters, 53(1):65–78, 2004.
  • [54] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
  • [55] Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008.
  • [56] Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005.
  • [57] Yuchen Zhang, John Duchi, and Martin Wainwright.

    Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates.

    The Journal of Machine Learning Research, 16(1):3299–3340, 2015.
  • [58] Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. In International conference on machine learning, pages 362–370, 2015.
  • [59] Yuchen Zhang, Martin J. Wainwright, and John C. Duchi. Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502–1510, 2012.

Appendix A Remarks

In this section we present some remarks about our work.

Alternative protocol

The protocol investigated in [32] updates the iterates via

The original motivations for this protocol are that it is fully decentralised, that agents are only required communicate locally, and that it reduces to a distributed averaging consensus protocol when the gradient is zero. The protocol (3) that we consider preserves these properties while making the analysis easier. For a discussion on the difference between the two protocols we refer to [40].

Network error

The network error terms (5) and (6) track the error between the distributed protocol and the ideal single-machine protocol. In the case of a complete graph the deviation is zero so the network terms vanish and the convergence rates for Single-Machine Gradient Descent are recovered. Following the literature on decentralised optimisation, we present our final results (cf. Theorem 4) in terms of the spectral gap, so plugging in the spectral gap of a complete graph in the bound in Theorem 4 does not immediately yield the Single-Machine Gradient Descent result.

Parameter tuning

The choice of parameters in Theorem 3 depends on the quantities and that are related to the estimation problem. In practice, these quantities are often unknown. In the single-machine setting, this lack of knowledge is typically addressed via cross-validation [47]. Investigating the design of decentralised cross-validation schemes is outside of the scope of this work and we leave it to future research. However, we highlight that as we consider implicit regularisation strategies and, in particular, early stopping, model complexity can be controlled with iteration time and this yields computational savings for cross-validation compared to methods that required to solve independent problem instances for different choices of parameters.

Accelerated gossip

Accelerated gossip schemes can also be considered to yield improved dependence on the network topology, depending on the amount of information agents have access to about the communication matrix . Accelerated gossip can be achieved by replacing the matrix by a polynomial of appropriate order, e.g. , leading to . The weights can be tuned to increase the spectral gap i.e. . We highlight that the algorithm that we consider only needs to have access to the number of nodes and the second largest eigenvalue in magnitude of the matrix . Within this framework, one can use Chebyshev polynomials to obtain the improved rate , and more information on the spectrum of yields better rates on the transitive phase [11, 5].

Additional requirements in Theorem 4

Theorem 4 includes two additional requirements over single-machine gradient descent, which we briefly explain the origins of. The requirement is purely cosmetic and serves to yield a cleaner bound. For more details, see the proof of Lemma C.3.2 in Section C.3.2. The requirement , on the other hand, often arises when analysing Distributed Gradient Descent, see [17] for instance. In particular, it ensures sufficient iterations have been performed to reach the mixing time of the Markov chain associated to . See Section C.3.1.

Appendix B Proof scheme

In this section we illustrate the main scheme for the proof of Theorem 4, from which Theorem 3 follows. Section B.1 presents the error decomposition into bias, variance, and network terms. Section B.2 presents the sketch of the statistical analysis for these terms, which is given in full in Section C.

b.1 Error decomposition

The error decomposition is based on the introduction of two auxiliary processes used to compare the iterates of Distributed Gradient Descent (3).

The first auxiliary process represents the iterates generated if agents were to know the marginal distribution . Initialised at , the process is defined as follows for :

This device has already been used in the analysis of non-parametric regression in the single-machine setting [26].

The second auxiliary process represents the iterates generated if agents were to be part of a complete graph topology and were to use the protocol given by . Initialised at for all , the process is defined as follows for :

The analysis of iterative decentralised algorithms typically builds upon the introduction of a device analogous to this one [32, 17]. Initialised at , Single-Machine Gradient Descent is defined as follows for :

It is easy to see that we have for and . This allows us to produce an analysis of Distributed Gradient Descent that relies upon known results for Single-Machine Gradient Descent.

Let us introduce the linear map defined by . The following error decomposition holds. For any and we have

From the work in [39], for any . Adding and subtracting and using we get

Following the same steps, adding and subtracting , we find

where we used the equality of and . Proposition B.1 decomposes the error into three terms. The first term is deterministic and corresponds to the square of the Bias in the single-machine setting [54]. The second term aligns with what is called the Sample Variance in the single-machine setting, and in this case matches the sample variance obtained for Single-Machine Gradient Descent run on all observations. The third term accounts for the error due to performing a decentralised protocol and we call it the Network Error.

b.2 Statistical analysis of error terms

In this section we illustrate the main ideas of the statistical analysis used to control the error terms in Proposition B.1. Full details are given in Section C.

Notation

Let and be positive natural numbers with . For any operator , define , with the convention , where is the identity operator on . Let denote a sequence of nodes in . For a family of operators indexed by the nodes on the graph , define and , with . Let be the probability of the path generated by a Markov Chain with transition kernel . For each agent , let with be the empirical covariance operator associated to the agent’s own data , and let . For , let be a random variable that only depends on the randomness in and that has zero mean, . The random variable , formally defined in (8) in Section C.3, captures the sampling error introduced at iteration of gradient descent by agent . For the discussion below it suffices to mentioned the two above properties.

The following paragraphs discuss the analysis for each of the error terms.

Bias

The analysis follows the single-machine setting and is given in Proposition C.1 in Section C.1.

Sample Variance

The analysis follows the single-machine setting [26], although the original result yields a high probability bound with a requirement on the number of samples . We therefore follow the result in [25] which yields a bound in high probability without a condition on the sample size. The bound for this term is presented in Theorem C.2 in Section C.2.

Network Error

Unraveling the iterates (Lemma C.3 in Section C.3) we get, for any :

This characterisation makes explicit the dependence of the network error on both the communication protocol used by the agents, via the dependence on the mixing properties of the gossip matrix along each path , and on the statistical properties of the problem, via the product of empirical covariance operators held by the agents along each path . As the randomness in the quantities might depend on the randomness in the empirical covariance operators, we further decompose the network error into two terms so that we can use the property . By adding and subtracting the terms inside the sums we have

From a statistical point of view, the Population Covariance Error term only depends on the population covariance via the quantities , and the only source of randomness is given by . Using concentration for , the square of this error term can be bounded by a quantity that decreases as , as announced in Section 4 alongside the discussion of Theorem 4. On the other hand, the Residual Empirical Covariance Error term depends on deviations between the empirical covariance and the population covariance via the quantities . Exploiting the additional concentration of these factors allows us to bound the square of this error term by a higher-order quantity that decreases as .

We now present a separate discussion on the analysis for these two error terms, emphasizing the interplay between network topology (mixing of random walks on graphs) and statistics (concentration). The final bound for the network error is presented in Theorem C.3.3 in Section C.3.

Population Covariance Error

Expanding the square yields a summation over all pairs of paths:

for properly defined quantities (the dependence on is neglected). When taking the expectation, as the random variables have zero mean and are independent across agents , the only paths left are those that intersect at the final node, i.e.  such that