# IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method

We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA arXiv:1404.6264 and SSDA arXiv:1702.08704. When coupled with accelerated gradient descent, our framework yields a novel primal algorithm whose convergence rate is optimal and matched by recently derived lower bounds. We provide experimental results that demonstrate the effectiveness of the proposed algorithm on highly ill-conditioned problems.

## Authors

• 14 publications
• 85 publications
• 5 publications
• 22 publications
• 55 publications
• 8 publications
06/14/2017

### Accelerated Extra-Gradient Descent: A Novel Accelerated First-Order Method

We provide a novel accelerated first-order method that achieves the asym...
11/18/2017

### A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization

We consider a primal-dual algorithm for minimizing f(x)+h(Ax) with diffe...
06/08/2021

### Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex Decentralized Optimization Over Time-Varying Networks

We consider the task of minimizing the sum of smooth and strongly convex...
10/27/2020

### Faster Lagrangian-Based Methods in Convex Optimization

In this paper, we aim at unifying, simplifying, and improving the conver...
08/28/2020

### An Efficient Augmented Lagrangian Method with Semismooth Newton Solver for Total Generalized Variation

Total generalization variation (TGV) is a very powerful and important re...
10/23/2019

### Accelerated Primal-Dual Algorithms for Distributed Smooth Convex Optimization over Networks

This paper proposes a novel family of primal-dual-based distributed algo...
07/19/2021

### Revisiting the Primal-Dual Method of Multipliers for Optimisation over Centralised Networks

The primal-dual method of multipliers (PDMM) was originally designed for...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Due to their rapidly increasing size, modern datasets are typically collected, stored and manipulated in a distributed manner. This, together with strict privacy requirements, has created a large demand for efficient solvers for the decentralized setting in which models are trained locally at each agent, and only local parameter vectors are shared. This approach has become particularly appealing for applications such as edge computing

Shi et al. (2016); Mao et al. (2017), cooperative multi-agent learning Bernstein et al. (2002); Panait and Luke (2005) and federated learning McMahan et al. (2017); Shokri and Shmatikov (2015). Clearly, the nature of the decentralized setting prevents a global synchronization, as only communication within the neighboring machines is allowed. The goal is then to arrive at a consensus on all local agents with a model that performs as well as in the centralized setting.

Arguably, the simplest approach for addressing decentralized settings is to adapt the vanilla gradient descent method to the underlying network architecture Xiao et al. (2007); Nedic and Ozdaglar (2009); Duchi et al. (2011); Jakovetić et al. (2014b). To this end, the connections between the agents are modeled through a mixing matrix, which dictates how agents average over their neighbors’ parameter vectors. Thus, the mixing matrix serves as a communication oracle which determines how information propagates throughout the network. Perhaps surprisingly, when the stepsizes are constant, simply averaging over the local iterates via the mixing matrix only converges to a neighborhood of the optimum Yuan et al. (2016); Shi et al. (2015). A recent line of works Shi et al. (2014); Jakovetić et al. (2014a); Shi et al. (2015); Qu and Li (2017); Nedic et al. (2017); Nedić et al. (2017) proposed a number of alternative methods that linearly converge to the global minimum.

The overall complexity of solving decentralized optimization problems is typically determined by two factors: (i) the condition number of the objective function , which measures the ‘hardness’ of solving the underlying optimization problem, and (ii) the condition number of the mixing matrix , which quantifies the severity of information ‘bottlenecks’ present in the network. Lower complexity bounds recently derived for distributed settings Arjevani and Shamir (2015); Scaman et al. (2017); Woodworth et al. (2018); Arjevani et al. (2020) show that one cannot expect to have a better dependence on the condition numbers than and . Notably, despite the considerable recent progress, none of the methods mentioned above is able to achieve accelerated rates, that is, a square root dependence for both and —simultaneously.

An extensive effort has been devoted to obtaining acceleration for decentralized algorithms under various settings Scaman et al. (2017, 2018); Li et al. (2018); Xu et al. (2020); Zhang et al. (2019); Uribe et al. (2020); Hendrikx et al. (2020); Dvinskikh and Gasnikov (2019); Fallah et al. (2019). When a dual oracle is available, that is, access to the gradients of the dual functions is provided, optimal rates can be attained for smooth and strongly convex objectives Scaman et al. (2017). However, having access to a dual oracle is a very restrictive assumption, and resorting to a direct ‘primalization’ through inexact approximation of the dual gradients leads to sub-optimal worst-case theoretical rates Uribe et al. (2020). In this work, we propose a novel primal approach that leads to optimal rates in terms of dependency on and .

Our contributions can be summarized as follows.

• [leftmargin=.2in]

• We introduce a novel framework based on the accelerated augmented Lagrangian method for designing primal decentralized methods. The framework provides a simple and systematic way for deriving several well-known decentralized algorithms Shi et al. (2014); Jakovetić et al. (2014a); Shi et al. (2015), including EXTRA Shi et al. (2015) and SSDA Scaman et al. (2017), and unifies their convergence analyses.

• Using accelerated gradient descent as a sub-routine, we derive a novel method for smooth and strongly convex local functions which achieves optimal accelerated rates on both the condition numbers of the problem, and , using primal updates, see Table 2.

• We perform a large number of experiments, which confirm our theoretical findings, and demonstrate a significant improvement when the objective function is ill-conditioned and .

## 2 Decentralized Optimization Setting

We consider computational agents and a network graph which defines how the agents are linked. The set of vertices represents the agents and the set of edges specifies the connectivity in the network, i.e., a communication link between agents and exists if and only if

. Each agent has access to local information encoded by a loss function

. The goal is to minimize the global objective over the entire network,

 minx∈Rdf(x):=n∑i=1fi(x). (1)

In this paper, we assume that the local loss functions are differentiable, -smooth and -strongly convex.111 is -smooth if is -Lipschitz; is -strongly convex if is convex. Strong convexity of the component functions implies that the problem admits a unique solution, which we denote by .

We consider the following computation and communication models Scaman et al. (2017):

• [leftmargin=.2in]

• Local computation: Each agent is able to compute the gradients of and the cost of this computation is one unit of time.

• Communication: Communication is done synchronously, and each agent can only exchange information with its neighbors, where is a neighbor of if . The ratio between the communication cost and computation cost per round is denoted by .

We further assume that propagation of information is governed by a mixing matrix (Nedic and Ozdaglar, 2009; Yuan et al., 2016; Scaman et al., 2017). Specifically, given a local copy of the decision variable at node , one round of communication provides the following update . The following standard assumptions regarding the mixing matrix Scaman et al. (2017) are made throughout the paper.

###### Assumption 1.

The mixing matrix satisfies the following:

1. [leftmargin=.2in]

2. Symmetry: .

3. Positiveness: is positive semi-definite.

4. Decentralized property: If and , then .

5. Spectrum property: The kernel of is given by the vector of all ones .

A typical choice of the mixing matrix is the (weighted) Laplacian matrix of the graph. Another common choice is to set as where

is a doubly stochastic matrix

Aybat and Gürbüzbalaban (2017); Can et al. (2019); Shi et al. (2015). By Assumption 1.4

, all the eigenvalues of

are strictly positive, except for the smallest one. We let denote the maximum eigenvalue, and let denote the smallest positive eigenvalue. The ratio between these two quantities plays an important role in quantifying the overall complexity of this problem.

###### Theorem 1 (Decentralized lower bound Scaman et al. (2017)).

For any first-order black-box decentralized method, the number of time units required to reach an -optimal solution for (1) is lower bounded by

 Ω(√κf(1+τ√κW)log(1ϵ)), (2)

where is the condition number of the loss function and is the condition number of the mixing matrix.

The lower bound decomposes as follows: a) computation cost, given by , and b) communication cost, given by . The computation cost matches lower bounds for centralized settings Nesterov (2004); Arjevani and Shamir (2016), while the communication cost introduces an additional term which depends on and accounts for the ‘price’ of communication in decentralized models. It follows that the effective condition number of a given decentralized problem is .

Clearly, the choice of the matrix can strongly affect the optimal attainable performance. For example, can get as large as in the line/cycle graph, or be constant in the complete graph. In this paper, we do not focus on optimizing over the choice of for a given graph ; instead, following the approach taken by existing decentralized algorithms, we assume that the graph and the mixing matrix W are given and aim to achieve the optimal complexity for this particular choice of .

## 3 Related Work and the Dual Formulation

A standard approach to adress problem (1) is to express it as a constrained optimization problem

 minX∈RndF(X):=1nn∑i=1fi(xi)such thatx1=x2=⋯=xn∈Rd, (P)

where is a concatenation of the vectors. To lighten the notation, we introduce the global mixing matrix , where denotes the Kronecker product, and let denote the semi-norm induced by , i.e. . With this notation in hand, we briefly review existing literature on decentralized algorithms.

The decentralized gradient method Nedic and Ozdaglar (2009); Yuan et al. (2016) has the update rule

 Xk+1=WXk−η∇F(Xk). (DGD)

However, with constant stepsize, the algorithm does not converge to a global minimum of (P), but rather to a neighborhood of the solution (Yuan et al., 2016). A decreasing stepsize schedule may be used to ensure convergence, but this yields a sublinear convergence rate, even in the strongly convex case.

### Linearly convergent primal algorithms

By and large, recent methods that achieve linear convergence in the strongly convex case Shi et al. (2014); Jakovetić et al. (2014a); Shi et al. (2015); Qu and Li (2017); Nedic et al. (2017); Nedić et al. (2017); Sun et al. (2019) can be shown to follow a general framework based on the augmented Lagrangian method, see Algorithm 1; The main difference lies in how subproblems are solved. Shi et al. (2014) apply an alternating directions method; in Shi et al. (2015), the EXTRA algorithm takes a single gradient descent step to solve , see Appendix B for details. Jakovetić et al. (2014a) use multi-step algorithms such as Jacobi/Gauss-Seidel methods. To the best of our knowledge, the complexity of these algorithms is not better than , in other words, they are non-accelerated. The recently proposed algorithm APM-C Li et al. (2018) enjoys a square root dependence on and , but incurs an additional factor compared to the optimal attainable rate.

### Optimal method based on the dual formulation

By Assumption 1.4, the constraint is equivalent to the identity , which is again equivalent to . Hence, the dual formulation of (P) is given by

 maxΛ∈Rdn−F∗(−√WΛ). (D)

Since the primal function is convex and the constraints are linear, we can use strong duality and address the dual problem instead of the primal one. Using this approach,  Scaman et al. (2017) proposed a dual method with optimal accelerated rates, using Nesterov’s accelerated gradient method for the dual problem (D). As mentioned earlier, the main drawback of this method is that it requires access to the gradient of the dual function which, unless the primal function has a relatively simple structure, is not available. One may apply a first-order method to approximate the dual gradients inexactly at the expense of an additional factor in the computation cost Uribe et al. (2020), but this woul make the algorithm no longer optimal. This indicates that achieving optimal rates when using primal updates is a rather challenging task in the decentralized setting. In the following sections, we provide a generic framework which allows us to derive a primal decentralized method with optimal complexity guarantees.

## 4 An Inexact Accelerated Augmented Lagrangian framework

In this section, we introduce our inexact accelerated Augmented Lagrangian framework, and show how to combine it with Nesterov’s acceleration. To ease the presentation, we first describe a conceptual algorithm, Algorithm 2, where subproblems are solved exactly, and only then introduce inexact inner-solvers.

Similarly to Nesterov’s accelerated gradient method, we use an extrapolation step for the dual variable . The component in line 4 of Algorithm 2 is the negative gradient of the Moreau-envelope222A proper definition of the Moreau-envelope is given in Rockafellar and Wets (2009), readers that are not familiar with this concept could take it as an implicit function which shares the same optimum as the original function. of the dual function. Hence our algorithm is equivalent to applying Nesterov’s method on the Moreau-envelope of the dual function, or equivalently, an accelerated dual proximal point algorithm. This renders the optimal dual method proposed in Scaman et al. (2017) as a special case of our algorithmic framework (with set to 0).

While Algorithm 2 is conceptually plausible, it requires an exact solution of the Augmented Lagrangian problems, which can be too expensive in practice. To address this issue, we introduce an inexact version, shown in Algorithm 3, where the -th subproblem is solved up to a predefined accuracy . The choice of is rather subtle. On the one hand, choosing a large may result in a non-converging algorithm. On the other hand, choosing a small can be exceedingly expensive as the optimal solution of the subproblem is not the global optimum . Intuitively, should be chosen to be of the same order of magnitude as , leading to the following result.

###### Theorem 2.

Consider the sequence of primal variables generated by Algorithm 3 with the subproblem solved up to accuracy in Option I. With parameters set to

 βk=√Lρ−√μρ√Lρ+√μρ,η=1Lρ,ϵk=μρ2λmax(W)⎛⎝1−12√μρLρ⎞⎠kΔdual, (3)

where , and is the initial dual function gap, we obtain

 ∥Xk−X∗∥2≤Cρ⎛⎝1−12√μρLρ⎞⎠kΔdual, (4)

where and .

###### Corollary 3.

The number of subproblems to achieve in IDEAL is bounded by

 K=O(√Lρμρlog(CρΔdualϵ)). (5)

We remark that inexact accelerated Augmented Lagrangian methods have been previously analyzed under different assumptions Nedelcu et al. (2014); Kang et al. (2015); Yan and He (2020). The main difference is that here, we are able to establish a linear convergence rate, whereas existing analyses only yield sublinear rates. One of the reasons for this discrepancy is that, although is strongly convex, the dual problem (D) is not, as the mixing matrix is singular. The key to obtaining a linear convergence rate is a fine-grained analysis of the dual problem, showing that the dual variables always lie in the subspace where strong convexity holds. The proof of the theorem relies on the equivalence between Augmented Lagrangian methods and the dual proximal point algorithm Rockafellar (1976); Bertsekas (2014), which can be interpreted as applying an inexact accelerated proximal point algorithm Güler (1992); Lin et al. (2017) to the dual problem. A complete convergence analysis is deferred to Section C in the appendix.

Theorem 2 provides an accelerated convergence rate with respect to the ‘augmented’ condition number , as determined by the Augmented Lagrangian parameter  in Algorithm 3. We have the following bounds:

 1ρ=∞≤κρ=L+ρλ+min(W)μ+ρλmax(W)λmax(W)λ+min(W)≤Lμλmax(W)λ+min(W)ρ=0=κfκW, (6)

where we observe that the condition number is a decreasing function of the regularization parameter . When , the maximum value is attained at , the effective condition number of the decentralized problem. As goes to infinity, the augmented condition number goes to 1. Naively, one may want to take as large as possible to get a fast convergence. However, one must also take into account the complexity of solving the subproblems. Indeed, since is singular, the additional regularization term in does not improve the strong convexity of the subproblems, yielding an increase in inner loops complexity as grows. Hence, the optimal choice of  requires balancing the inner and outer complexity in a careful manner.

To study the inner loop complexity, we introduce a warm-start strategy. Intuitively, the distance between and the -th solution to the subproblem is roughly on the order of . More precisely, we have the following result.

###### Lemma 4.

Given the parameter choice in Theorem 2, initializing the subproblem at yields,

 ∥Xk−1−X∗k∥2≤8Cρμρϵk−1.

Consequently, the ratio between the initial gap at the -th subproblem and the desired gap is bounded by

 ∥Xk−1−X∗k∥2ϵk≤8Cρμρϵk−1ϵk≤16Cρμρ=O(κfκWρ2),

which is independent of . In other words, the inner loop solver only needs to decrease the iterate gap by a constant factor for each . If the algorithm enjoys a linear convergence rate, a constant number of iteration is sufficient for that. If the algorithm enjoys a sublinear convergence, then the inner loop complexity grows with . To illustrate the behaviour of different algorithms, we present the inner loop complexity

1. Note that while the inner complexity of GD and AGD are independent of , the inner complexity for SGD increases geometrically with . Other possible choices for inner solvers are the alternating directions or Jacobi/Gauss-Seidel method, both of which yield accelerated variants for Shi et al. (2014) and Jakovetić et al. (2014a).

In fact, the theoretical upper bounds on the inner complexity also provide a more practical way to halt the inner optimization processes (see Option II in Algorithm 3). Indeed, one can predefine the computational budget for each subproblem, for instance, iterations of AGD. If this budget exceeds the theoretical inner complexity in Table 1, then the desired accuracy  is guaranteed to be reached. In particular, we do not need to evaluate the sub-optimality condition, it is automatically satisfied as long as the budget is chosen appropriately.

Finally, the global complexity is obtained by summing , where is the number of subproblems given in (5). Note that, so far, our analysis applies to any regularization parameter . Since is a function of , this implies that one can select the parameter such that the overall complexity is minimized, leading to the choices of described in Table 1.

### Two-fold acceleration

In our setting, acceleration seems to occur in two stages (when compared to the non-accelerated rates in Shi et al. (2014); Jakovetić et al. (2014a); Shi et al. (2015); Qu and Li (2017); Nedic et al. (2017)). First, combining IDEAL with GD improves the dependence on the condition of the mixing matrix . Secondly, when used as an inner solver, AGD improves the dependence on the condition number of the local functions . This suggests that the two phenomena are independent; while one is related to the consensus between the agents, as governed by the mixing matrix, the other one is related to the respective centralized hardness of the optimization problem.

### Stochastic oracle

Our framework also subsumes the stochastic setting, where only noisy gradients are available. In this case, since SGD is sublinear, the required iteration counters for the subproblem must increase inversely proportional to . Also the stepsize at the -th iteration needs to be decreased accordingly. The overall complexity is now given by . However, in this case, the resulting dependence on the graph condition number can be improved Fallah et al. (2019).

### Multi-stage variant (MIDEAL)

We remark that the complexity presented in Table 2 is abbreviated, in the sense that it does not distinguish between communication cost and computation cost. To provide a more fine-grained analysis, it suffices to note that performing a gradient step of the subproblem requires one local computation to evaluate , and one round of communication to obtain . This implies that when GD/AGD/SGD is combined with IDEAL, the number of local computation rounds is roughly the number of communication rounds, leading to a sub-optimal computation cost, as shown in Table 2.

To achieve optimal accelerated rates, we enforce multiple communication rounds after one evaluation of . This is achieved by substituting the regularization metric with , where is a well-chosen polynomial. In this case, the gradient of the subproblem becomes , which requires rounds of communication.

The choice of the polynomial relies on Chebyshev acceleration, which is introduced in Scaman et al. (2017); Auzinger and Melenk (2011). More concretely, the Chebyshev polynomials are defined by the recursion relation , , , and is defined by

 Q(x)=1−TjW(c(1−x))TjW(c) with jW=⌊√κW⌋,c=κW+1κW−1. (7)

Applying this specific choice of to the mixing matrix reduces its condition number by the maximum amount Scaman et al. (2017); Auzinger and Melenk (2011), yielding a graph independent bound . Moreover, the symmetry, positiveness and spectrum property in Assumption 1 are maintained by . Even though no longer satisfies the decentralized property, it can be implemented using rounds of communications with respect to . The implementation details of the resulting algorithm are similar to Algorithm 2, and follow by substituting the mixing matrix by  (Algorithm 5 in Appendix E).

### Comparison with inexact SSDA/MSDA Scaman et al. (2017)

Recall that SSDA/MSDA are special cases of our algorithmic framework with the degenerate regularization parameter . Therefore, our complexity analysis naturally extends to an inexact anlysis of SSDA/MSDA, as shown in Table 2. although the resulting communication costs are optimal, the computation cost is not, due to the additional factor introduced by solving the subproblems inexactly. In contrast, our multi-stage framework achieves the optimal computation cost.

• [leftmargin=.1in]

• Low communication cost regime: , the computation cost dominates the communication cost, a improvement is obtained by MIDEAL comparing to MSDA.

• Ill conditioned regime: , the complexity of MSDA is dominated by the computation cost while the complexity MIDEAL is dominated by the communication cost . The improvement is proportional to the ratio .

• High communication cost regime: , the communication cost dominates, and MIDEAL and MSDA are comparable.

## 5 Experiments

Having described the IDEAL/MIDEAL algorithms for decentralized optimization problem (1), we now turn to presenting various empirical results which corroborate our theoretical analysis. To facilitate a simple comparison between existing state-of-the-art algorithms, we consider an

-regularized logistic regression task over two classes of the MNIST

LeCun et al. (2010) benchmark dataset. The smoohtness parameter (assuming normalized feature vectors) can be shown to be bounded by , which together with a regularization parameter , yields a relatively high -bound on the condition number of the loss function. Further empirical results which demonstrate the robustness of  IDEAL/MIDEAL under wide range of parameter choices are provided in Appendix G.

We compare the performance of IDEAL/MIDEAL with the state-of-the-art algorithms EXTRA Shi et al. (2015), APM-C Li et al. (2018) and the inexact dual method SSDA/MSDA Scaman et al. (2017). We set the inner iteration counter to be for all algorithms, and use the theoretical stepsize schedule. The decentralized environment is modelled in a synthetic setting, where the communication time is steady and no latency is encountered. To demonstrate the effect of the underlying network architecture, we consider: a) a circular graph, where the agents form a cycle; b) a Barbell graph, where the agents are split into two complete subgraphs, connected by a single bridge (shown in Figure 2 in the appendix).

As shown in Figure 1, our multi-stage algorithm MIDEAL is optimal in the regime where the communication cost is small, and the single-stage variant IDEAL is optimal when is large. As expected, the inexactness mechanism significantly slows down the dual method SSDA/MSDA in the low communication cost regime. In contrast, the APM-C algorithm performs reasonably well in the low communication regime, but performs relatively poorly when the communication cost is high.

## 6 Conclusions

We propose a novel framework of decentralized algorithms for smooth and strongly convex objectives. The framework provides a unified viewpoint of several well-known decentralized algorithms and, when instantiated with AGD, achieves optimal convergence rates in theory and state-of-the-art performance in practice. We leave further generalization to (non-strongly) convex and non-smooth objectives to future work.

## Acknowledgements

YA and JB acknowledge support from the Sloan Foundation and Samsung Research. BC and MG acknowledge support from the grants NSF DMS-1723085 and NSF CCF-1814888. HL and SJ acknowledge support by The Defense Advanced Research Projects Agency (grant number YFA17 N66001-17-1-4039). The views, opinions, and/or findings contained in this article are those of the author and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

Centralization of data is not always possible because of security and legacy concerns GDPR (2016). Our work proposes a new optimization algorithm in the decentralized setting, which can learn a model without revealing the privacy sensitive data. Potential applications include data coming from healthcare, environment, safety, etc, such as personal medical information Jochems et al. (2016, 2017), keyboard input history McMahan et al. (2016); Konečný et al. (2016) and beyond.

## References

• Y. Arjevani, O. Shamir, and N. Srebro (2020) A tight convergence analysis for stochastic gradient descent with delayed updates. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, Vol. 117, pp. 111–132. Cited by: §1.
• Y. Arjevani and O. Shamir (2015) Communication complexity of distributed convex learning and optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §1.
• Y. Arjevani and O. Shamir (2016) On the iteration complexity of oblivious first-order optimization algorithms. In

International Conferences on Machine Learning (ICML)

,
Cited by: §2.
• W. Auzinger and J.M. Melenk (2011) Iterative solution of large linear systems. Lecture Note. Cited by: §4.
• N. S. Aybat and M. Gürbüzbalaban (2017) Decentralized computation of effective resistances and acceleration of consensus algorithms. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 538–542. Cited by: §2.
• D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein (2002)

The complexity of decentralized control of markov decision processes

.
Mathematics of operations research 27 (4), pp. 819–840. Cited by: §1.
• D. P. Bertsekas (2014) Constrained optimization and lagrange multiplier methods. Academic press. Cited by: §4.
• B. Can, S. Soori, N. S. Aybat, M. M. Dehvani, and M. Gürbüzbalaban (2019) Decentralized computation of effective resistances and acceleration of distributed optimization algorithms. arXiv preprint arXiv:1907.13110. Cited by: Appendix A, §2.
• J. C. Duchi, A. Agarwal, and M. J. Wainwright (2011) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic control 57 (3), pp. 592–606. Cited by: §1.
• D. Dvinskikh and A. Gasnikov (2019) Decentralized and parallelized primal and dual accelerated methods for stochastic convex programming problems. arXiv preprint arXiv:1904.09015. Cited by: §1.
• A. Fallah, M. Gürbüzbalaban, A. Ozdaglar, U. Simsekli, and L. Zhu (2019) Robust distributed accelerated stochastic gradient methods for multi-agent networks. arXiv preprint arXiv:1910.08701. Cited by: §1, §4.
• GDPR (2016) The eu general data protection regulation (gdpr). Cited by: Broader impact.
• O. Güler (1992) New proximal point algorithms for convex minimization. SIAM Journal on Optimization 2 (4), pp. 649–664. Cited by: §4.
• H. Hendrikx, F. Bach, and L. Massoulie (2020) An optimal algorithm for decentralized finite sum optimization. External Links: 2005.10675 Cited by: §1.
• D. Jakovetić, J. M. Moura, and J. Xavier (2014a) Linear convergence rate of a class of distributed augmented lagrangian algorithms. IEEE Transactions on Automatic Control 60 (4), pp. 922–936. Cited by: Appendix A, 1st item, §1, §3, §4, §4.
• D. Jakovetić, J. Xavier, and J. M. Moura (2014b) Fast distributed gradient methods. IEEE Transactions on Automatic Control 59 (5), pp. 1131–1146. Cited by: §1.
• A. Jochems, T. M. Deist, I. El Naqa, M. Kessler, C. Mayo, J. Reeves, S. Jolly, M. Matuszak, R. Ten Haken, J. van Soest, et al. (2017) Developing and validating a survival prediction model for nsclc patients through distributed learning across 3 countries. International Journal of Radiation Oncology* Biology* Physics 99 (2), pp. 344–352. Cited by: Broader impact.
• A. Jochems, T. M. Deist, J. Van Soest, M. Eble, P. Bulens, P. Coucke, W. Dries, P. Lambin, and A. Dekker (2016) Distributed learning: developing a predictive model based on data from multiple hospitals without data leaving the hospital–a real life proof of concept. Radiotherapy and Oncology 121 (3), pp. 459–467. Cited by: Broader impact.
• M. Kang, M. Kang, and M. Jung (2015) Inexact accelerated augmented lagrangian methods. Computational Optimization and Applications 62 (2), pp. 373–404. Cited by: §4.
• J. Konečný, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning, External Links: Link Cited by: Broader impact.
• Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online] 2. External Links: Link Cited by: §5.
• H. Li, C. Fang, W. Yin, and Z. Lin (2018) A sharp convergence rate analysis for distributed accelerated gradient methods. arXiv preprint arXiv:1810.01053. Cited by: §1, §3, §5.
• H. Lin, J. Mairal, and Z. Harchaoui (2017) Catalyst acceleration for first-order convex optimization: from theory to practice. The Journal of Machine Learning Research 18 (1), pp. 7854–7907. Cited by: §4.
• J. Mairal (2016) End-to-end kernel learning with supervised convolutional kernel networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: Figure 3.
• Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief (2017) A survey on mobile edge computing: the communication perspective. IEEE Communications Surveys & Tutorials 19 (4), pp. 2322–2358. Cited by: §1.
• B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §1.
• H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: Broader impact.
• V. Nedelcu, I. Necoara, and Q. Tran-Dinh (2014) Computational complexity of inexact gradient augmented lagrangian methods: application to constrained mpc. SIAM Journal on Control and Optimization 52 (5), pp. 3109–3134. Cited by: §4.
• A. Nedić, A. Olshevsky, W. Shi, and C. A. Uribe (2017) Geometrically convergent distributed optimization with uncoordinated step-sizes. In 2017 American Control Conference (ACC), pp. 3950–3955. Cited by: Appendix A, §1, §3.
• A. Nedic, A. Olshevsky, and W. Shi (2017) Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization 27 (4), pp. 2597–2633. Cited by: Appendix A, §1, §3, §4.
• A. Nedic and A. Ozdaglar (2009) Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control 54 (1), pp. 48–61. Cited by: §1, §2, §3.
• Y. Nesterov (2004) Introductory lectures on convex optimization. Vol. 87, Springer Science & Business Media. Cited by: §2.
• L. Panait and S. Luke (2005) Cooperative multi-agent learning: the state of the art. Autonomous agents and multi-agent systems 11 (3), pp. 387–434. Cited by: §1.
• G. Qu and N. Li (2017) Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems 5 (3), pp. 1245–1260. Cited by: Appendix A, §1, §3, §4.
• R. T. Rockafellar and R. J. Wets (2009) Variational analysis. Vol. 317, Springer Science & Business Media. Cited by: footnote 2.
• R. T. Rockafellar (1976) Augmented lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research 1 (2), pp. 97–116. Cited by: §4.
• K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié (2017) Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International Conferences on Machine Learning (ICML), Cited by: Appendix A, Appendix E, Figure 4, IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method, 1st item, §1, §1, §2, §2, §3, §4, §4, Table 2, §4, §5, Theorem 1, Algorithm 6.
• K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee (2018) Optimal algorithms for non-smooth distributed optimization in networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §1.
• M. Schmidt, N. L. Roux, and F. R. Bach (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: Appendix C.
• W. Shi, Q. Ling, G. Wu, and W. Yin (2015) Extra: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization 25 (2), pp. 944–966. Cited by: Appendix A, Appendix B, Appendix B, Figure 5, IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method, 1st item, §1, §2, §3, §4, §5.
• W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin (2014) On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing 62 (7), pp. 1750–1761. Cited by: Appendix A, 1st item, §1, §3, §4, §4.
• W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu (2016)

Edge computing: vision and challenges

.
IEEE internet of things journal 3 (5), pp. 637–646. Cited by: §1.
• R. Shokri and V. Shmatikov (2015)

Privacy-preserving deep learning

.
In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §1.
• Y. Sun, A. Daneshmand, and G. Scutari (2019) Convergence rate of distributed optimization algorithms based on gradient tracking. arXiv preprint arXiv:1905.02637. Cited by: §3.
• C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić (2020) A dual approach for optimal algorithms in distributed optimization over networks. Optimization Methods and Software, pp. 1–40. Cited by: §1, §3.
• B. E. Woodworth, J. Wang, A. Smith, B. McMahan, and N. Srebro (2018) Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), Cited by: §1.
• L. Xiao, S. Boyd, and S. Kim (2007) Distributed average consensus with least-mean-square deviation. Journal of parallel and distributed computing 67 (1), pp. 33–46. Cited by: §1.
• J. Xu, Y. Tian, Y. Sun, and G. Scutari (2020) Accelerated primal-dual algorithms for distributed smooth convex optimization over networks. International Conference on Artificial Intelligence and Statistics (AISTATS). Cited by: §1.
• S. Yan and N. He (2020) Bregman augmented lagrangian and its acceleration. External Links: 2002.06315 Cited by: §4.
• K. Yuan, Q. Ling, and W. Yin (2016) On the convergence of decentralized gradient descent. SIAM Journal on Optimization 26 (3), pp. 1835–1854. Cited by: §1, §2, §3.
• J. Zhang, C. A. Uribe, A. Mokhtari, and A. Jadbabaie (2019) Achieving acceleration in distributed optimization via direct discretization of the heavy-ball ode. In 2019 American Control Conference (ACC), pp. 3408–3413. Cited by: §1.

## Appendix A Remark on the choice of the mixing matrix

In the main paper, the mixing matrix is defined following the convention used in Scaman et al. (2017), where the kernel of is the vector of all ones. It is worth noting that the term mixing matrix is also used in the literature to denote a doubly stochastic matrix (see e.g. Shi et al. (2014); Jakovetić et al. (2014a); Shi et al. (2015); Qu and Li (2017); Nedic et al. (2017); Nedić et al. (2017); Can et al. (2019)). These two approaches are equivalent as given a doubly stochastic matrix , the matrix

 I−WDS is a mixing matrix under Definition~{}???.

In the following discussion, we will use to draw the connection when necessary.

## Appendix B Recovering EXTRA under the augmented Lagrangian framework

The goal of this section is to show that EXTRA algorithm Shi et al. (2015) is a special case of the non-accelerated Augmented Lagrangian framework in Algorithm 1.

###### Proposition 5.

The EXTRA algorithm is equivalent to applying one step of gradient descent to solve the subproblem in Algorithm 1.

###### Proof.

Taking a single step of gradient descent in the subproblem in Algorithm 1 warm starting at yields the update

 Xk =Xk−1−α(∇F(Xk−1)+Λk+ρWXk−1). (8) Λk+1 =Λk+ηWXk.

Using the -th update,

 Xk+1=Xk−α(∇F(Xk)+Λk+1+ρWXk). (9)

and subtracting (8) from (9) gives

 Xk+1=(2−α(ρ+η)W)Xk−(1−αρW)Xk−1−α(∇F(Xk)−∇F(Xk−1)).

When incorporating with the mixing matrix and taking gives,

 Xk+1=(I+WDS)Xk−(I+WDS2)Xk−1−α(∇F(Xk)−∇F(Xk−1)),

which is the update rule of EXTRA Shi et al. (2015). ∎

###### Remark 6.

When expressing the parameters in terms of , the inner loop stepsize reads as , and the outer-loop stepsize reads as .

## Appendix C Proof of Theorem 3

We start by noting that Algorithm 2 is equivalent to the “unscaled" version of Algorithm 4. More specifically, we recover Algorithm 2 by substituting the variables

 Λ←√WΛ,Ω←√WΩ.

The unscaled version is computationally inefficient since it requires the computation of the square root of . This is the reason why we choose to present the scaled version Algorithm 2 in the main paper. However, the unscaled version is easier to work with for the analysis. In the following proof, the variables and are referred to as in the unscaled version Algorithm 4.

The key concept underlying our analysis on is the Moreau-envelope of the dual problem:

 Φρ(Λ)=minΓ∈Rnd{F∗(−√WΓ)+12ρ∥Γ−Λ∥2}. (10)

Similarly, we define the associated proximal operator

 proxΦρ(Λ)=argminΓ∈Rnd{F∗(−√WΓ)+12ρ∥Γ−Λ∥2}. (11)

Note that when the inner problem is strongly convex, the proximal operator is unique (that is, a single-valued operator). The following is a list well known properties of the Moreau-envelope:

###### Proposition 7.

The Moreau envelope enjoys the following properties

1. [leftmargin=.2in]

2. is convex and it shares the same optimum as the dual problem (D).

3. is differentiable and the gradient of is given by

 ∇Φρ(Λ)=1ρ(Λ−proxΦρ(Λ))
4. If is twice differentiable, then its convex conjugate is also twice differentiable. In this case, is also twice differentiable and the Hessian is given by

 ∇2Φρ(Λ)=1ρI−1ρ2[1ρI+√W∇2F∗(−√WproxΦρ(Λ))√W]−1.
###### Corollary 8.

The Moreau envelope satisfies

1. [leftmargin=.2in]

2. is -smooth, where .

3. is -strongly convex in the image space of , where .

###### Proof.

These properties follow from the expressions for the Hessian of and by the fact that is -smooth and strongly convex. ∎

In particular, is only strongly convex on the image space of , one of the keys to prove the linear convergence rate is the following lemma.

###### Lemma 9.

The variables and in the un-scaled version Algorithm 4 all lie in the image space of for any .

###### Proof.

This can be easily derived by induction according to the update rule in line 4, 5 of Algorithm 4. ∎

Similar to the dual Moreau-envelope, we also define the weighted Moreau-envelope on the primal function

 Ψρ(Ω)=minX{F(X)+ΩTX+ρ2∥X∥2W} (12)

and its associated proximal operator

 proxΨρ(Ω)=argminX{F(X)+ΩTX+ρ2∥X∥2W}. (13)

Indeed, this function corresponds exactly to the subproblem solved in the augmented Lagrangian framework (line 3 of Algorithm 2). Similar property holds for :

###### Proposition 10.

The Moreau envelope enjoys the following properties:

1. [leftmargin=.2in]

2. is concave.

3. is differentiable and the gradient of is given by

 ∇Ψρ(Ω)=proxΨ(Ω).
4. If is twice differentiable, then is also twice differentiable and the Hessian is given by

 ∇2Ψρ(Ω)=−[∇2F(proxΨ(Ω))+ρW]−1.

In particular, is -smooth and strongly concave.

The dual Moreau-envelope and primal Moreau-envelope are connected through the following relationship.

###### Proposition 11.

The gradient of the Moreau envelope is given by

 ∇Φρ(Λ)=−√W∇Ψρ(√WΛ). (14)
###### Proof.

To simplify the presentation, let us denote

 X(Λ)=argminX{F(X)+(√WΛ)TX+ρ2∥X∥2W}=∇Ψρ(√WΛ).

From the optimality of , we have

 ∇F(X(Λ))+√WΛ+ρWX(Λ)=0

From the fact that , we have

Let , then

 −√W∇F∗(−√WΓ)+1ρ(Γ−Λ)=0.

Therefore is the minimizer of the function , namely

 proxΦρ(Λ)=Λ+ρ√WX(Λ).

Then based on the expression for the gradient in Prop 7, we obtain the desired equality (14). ∎

Proposition 14 demonstrates that solving the augmented Lagrangian subproblem could be viewed as evaluating the gradient of the Moreau-envelope. Hence applying gradient descent on the Moreau-envelope gives the non-accelerated augmented Lagrangian framework Algorithm 1. Even more, applying Nesterov’s accelerated gradient on the Moreau-envelope yields accelerated Augmented Lagrangian Algorithm 4. In addition, when the subproblems are solved inexactly, this corresponds to an inexact evaluation on the gradient. This interpretation allows us to derive guarantees for the convergence rate of the dual variables. Before present the the convergence analysis in detail, we formally establish the connection between the primal solution and the dual solution.

###### Lemma 12.

Let be the optimum of and define . Then there exists a unique such that is the optimum of the dual problem (D). Moreover, it satisfies

 ∇F(X∗)=−√WΛ∗.
###### Proof.

Since , we have

 Ker(W)=Ker(W⊗Id)=Vect(1n⊗ei,i=1,⋯,d),

where is the canonical basis with all entries 0 except the -th equals to 1. By optimality, . This implies that , for all . In other words, is orthogonal to the null space of , namely . Therefore, there exists such that . By setting , we have and . In particular, since , we have,

 √W∇F∗(−√WΛ∗)=√WX∗=0. (15)

Hence is the solution of the dual problem (D) and it is the unique one lies in the . ∎

Throughout the rest of the paper, we use to denote the unique solution as shown in the lemma above. We would like to emphasize that even though is strongly convex, the dual problem (D) is not strongly convex, because is singular. Hence, the solution of the dual problem is not unique unless we restrict to the image space of . To derive the linear convergence rate, we need to show that the dual variable always lies in this subspace where the Moreau-envelope is strongly convex.

###### Theorem 13.

Consider the sequence of primal variables generated by Algorithm 3 with the subproblem solved up to accuracy, i.e. Option I. Therefore,

 ∥Xk+1−X∗∥2≤2ϵk+1+C⎛⎝1−√μρLρ⎞⎠k(√μρΔdual+Ak)2 (16)

where , , , , is the dual function gap defined by and

###### Proof.

The proof builds on the concepts developed so far in this section. We start by showing that the dual variable converges to the dual solution in a linear rate. From the interpretation given in Prop 7 and Prop 11, the sequence given in Algorithm 2 is equivalent to applying Nesterov’s accelerated gradient method on the Moreau-envelope . In the inexact variant, the inexactness on the solution directly translates to an inexact gradient of , where the inexactness is given by

 ∥ek∥=∥√W(Xk−X∗k)∥≤√λmax(W)∥Xk−X∗k∥≤√λmax(W)ϵk.

Hence in Algorithm 4 is obtained by applying inexact accelerated gradient method on the Moreau-envelope . Note that by induction and belong to the image space of , in which the dual Moreau-envelope is strongly convex. Following the analysis on inexact accelerated gradient method Prop 4 in Schmidt et al. (2011), we have

 μρ2∥Λk+1−Λ∗∥2≤⎛⎝1−√μρLρ⎞⎠k+1(√2ΔΦρ+√2μρAk)2 (17)

where and is the accumulation of the errors given by

 Ak=k∑i=1∥ei∥⎛⎝1−√μρLρ⎞⎠−i/2≤k∑i=1√λmax(W)ϵi⎛⎝1−√μρLρ⎞⎠−i/2.

Based on the convergence on the dual variable, we could now derive the convergence on the primal variable. Let be the exact solution of the problem . Then

 ∥X∗k+1−X∗∥ =∥∇Ψρ(√WΛk+1)−∇Ψρ(√WΛ∗)∥ ≤1μ