# The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where M machines work in parallel over the course of R rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute K stochastic gradient estimates. We present a novel lower bound with a matching upper bound that establishes an optimal algorithm.

## Authors

• 18 publications
• 10 publications
• 58 publications
• 74 publications
• ### On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

For SGD based distributed stochastic optimization, computation complexit...
05/10/2019 ∙ by Hao Yu, et al. ∙ 0

• ### Complexity Lower Bounds for Nonconvex-Strongly-Concave Min-Max Optimization

We provide a first-order oracle complexity lower bound for finding stati...
04/18/2021 ∙ by Haochuan Li, et al. ∙ 10

• ### Communication Complexity of Distributed Convex Learning and Optimization

We study the fundamental limits to communication-efficient distributed m...
06/05/2015 ∙ by Yossi Arjevani, et al. ∙ 0

• ### Distributed MST: A Smoothed Analysis

We study smoothed analysis of distributed graph algorithms, focusing on ...
11/06/2019 ∙ by Soumyottam Chatterjee, et al. ∙ 0

• ### Exploitation of Multiple Replenishing Resources with Uncertainty

We consider an optimization problem in which a (single) bat aims to expl...
07/19/2020 ∙ by Amos Korman, et al. ∙ 0

We study distributed stochastic convex optimization under the delayed gr...
08/20/2015 ∙ by Suvrit Sra, et al. ∙ 0

• ### Potential Function-based Framework for Making the Gradients Small in Convex and Min-Max Optimization

Making the gradients small is a fundamental optimization problem that ha...
01/28/2021 ∙ by Jelena Diakonikolas, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The min-max oracle complexity of stochastic convex optimization in a sequential (non-parallel) setting is very well-understood, and we have provably optimal algorithms that achieve the min-max complexity (Lan, 2012; Ghadimi and Lan, 2013). However, we do not yet have an understanding of the min-max complexity of stochastic optimization in a distributed setting, where oracle queries and computation are performed by different workers, with limited communication between them. Perhaps the simplest, most basic, and most important distributed setting is that of intermittent communication.

In the (homogeneous) intermittent communication setting, parallel workers are used to optimize a single objective over the course of rounds. During each round, each machine sequentially and locally computes independent unbiased stochastic gradients of the global objective, and then all the machines communicate with each other. This captures the natural setting where multiple parallel “workers” or “machines” are available, and computation on each worker is much faster than communication between workers. It includes applications ranging from optimization using multiple cores or GPUs, to using a cluster of servers, to Federated Learning111In a realistic Federated Learning setting, stochastic gradient estimates on the same machine might be correlated, or we might prefer thinking of a heterogeneous setting where each device has a different local objective. Nevertheless, much of the methodological and theoretical development in Federated Learning has been focused on the homogeneous intermittent communication setting we study here (see Kairouz et al., 2019, and citations therein). where workers are edge devices.

The intermittent communication setting has been widely studied for over a decade, with many optimization algorithms proposed and analyzed (Zinkevich et al., 2010; Cotter et al., 2011; Dekel et al., 2012; Zhang et al., 2013a, c; Shamir and Srebro, 2014), and obtaining new methods and improved analysis is still a very active area of research (Wang et al., 2017; Stich, 2018; Wang and Joshi, 2018; Khaled et al., 2019; Haddadpour et al., 2019; Woodworth et al., 2020b). However, despite these efforts, we do not yet know which methods are optimal, what the min-max complexity is, and what methodological or analytical improvements might allow us to make further progress.

Considerable effort has been made to formalize the setting and establish lower bounds for distributed optimization (Zhang et al., 2013b; Arjevani and Shamir, 2015; Braverman et al., 2016) and here, we follow the graph-oracle formalization of Woodworth et al. (2018). However, a key issue in the existing literature is that known lower bounds for the intermittent communication setting depend only on the product (i.e. the total number of gradients computed on each machine over the course of optimization), and not on the number of rounds, , and the number of gradients per round, , separately.

Thus, existing results cannot rule out the possibility that the optimal rate for fixed can be achieved using only a single round of communication (), since they do not distinguish between methods that communicate very frequently (, ) and methods that communicate just once (, ). The possibility that the optimal rate is achievable with was suggested by Zhang et al. (2013c), and indeed Woodworth et al. (2020b) proved that an algorithm that communicates just once is optimal in the special case of quadratic objectives. While it seems unlikely that a single round of communication suffices in the general case, none of our existing lower bounds are able to answer this extremely basic question.

In this paper, we resolve (up to a logarithmic factor) the minimax complexity of smooth, convex stochastic optimization in the (homogeneous) intermittent communication setting and we show that, generally speaking, a single round of communication does not suffice to achieve the min-max optimal rate. Our main result in Section 3

is a lower bound on the optimal rate of convergence and a matching upper bound. Interestingly, we show that the combination of two extremely simple and naïve methods based on an accelerated stochastic gradient descent (SGD) variant called AC-SA

(Lan, 2012) is optimal up to a logarithmic factor. Specifically, we show that the better of the following methods is optimal: “Minibatch Accelerated SGD” which executes steps of AC-SA using minibatch gradients of size , and “Single-Machine Accelerated SGD” which executes steps of AC-SA on just one of the machines, completely ignoring the other .

These methods might seem to be horribly inefficient: Minibatch Accelerated SGD only performs one update per round of communication, and Single-Machine Accelerated SGD only uses one of the available workers! This perceived inefficiency has prompted many attempts at developing improved methods which take multiple steps on each machine locally in parallel including, in particular, numerous analyses of Local SGD (Zinkevich et al., 2010; Dekel et al., 2012; Stich, 2018; Haddadpour et al., 2019; Khaled et al., 2019; Woodworth et al., 2020b). Nevertheless, we establish that one or the other is optimal in every regime, so more sophisticated methods cannot yield improved guarantees for arbitrary smooth objectives. Our results therefore highlight an apparent dichotomy between exploiting the available parallelism but not the local computation (Minibatch Accelerated SGD) and exploiting the local computation but not the parallelism (Single-Machine Accelerated SGD).

Our lower bound applies quite broadly, including to the homogeneous setting considered by much of the existing work on stochastic first-order optimization in the intermittent communication setting. But, like many lower bounds, we should not interpret this to mean we cannot make progress. Rather, it indicates that we need to expand our model or modify our assumptions in order to develop better methods. In Section 5 we explore several additional assumptions that allow for circumventing our lower bound. These include when the third derivative of the objective is bounded (as in recent work by Yuan and Ma (2020)), when the objective has a certain statistical learning-like structure, or when the algorithm has access to a more powerful oracle.

Our work on the homogeneous setting—where each machine has stochastic gradients from the same distribution—complements prior work in the heterogeneous regime—where each machine has stochastic gradients from a different distribution. In the heterogeneous setting, prior work (Arjevani and Shamir, 2015) has already established lower bounds that both distinguish between and and, more strongly, show that the min-max rate is generally dominated by a term scaling as , meaning that local computation is of limited utility in the heterogeneous case. However, these results depend very strongly on the heterogeneity of the problem, and do not apply in our setting. In particular, in the homogeneous setting, it is always possible to achieve arbitrarily small error without any communication as long as is large enough (e.g. by performing SGD on a single one of the machines). Our lower bound is therefore the first lower bound that distinguishes from independently of the level of heterogeneity—indeed, even in the case that the problem is homogeneous.

## 2 Setting and Notation

We aim to understand the fundamental limits of stochastic first-order algorithms in the intermittent communication setting. Accordingly, we consider a standard smooth, convex problem

 minxF(x) (1)

where is convex, , and is -smooth, so for all

 F(x)+⟨∇F(x),y−x⟩≤F(y)≤F(x)+⟨∇F(x),y−x⟩+H2∥y−x∥2 (2)

We consider algorithms that gain information about the objective via a stochastic gradient oracle

with bounded variance

222This assumption can be strong, and does not hold for natural problems like least squares regression (Nguyen et al., 2019), nevertheless, this strengthens rather than weakens our lower bound., which satisfies for all

 Eg(x)=∇F(x)andE∥g(x)−∇F(x)∥2≤σ2 (3)

This is a well-studied class of optimization objectives: smooth, bounded, convex objectives with a bounded-variance stochastic gradient oracle.

To understand optimal methods for this class of problems requires specifying a class of optimization algorithms. We consider intermittent communication algorithms, which attempt to optimize using parallel workers, each of which is allowed queries to in each of rounds of communication. Such intermittent communication algorithms can be formalized using the graph oracle framework of Woodworth et al. (2018) which focuses on the dependence structure between different stochastic gradient computations, and to facilitate our lower bounds, we follow Carmon et al. (2017) and focus our attention on distributed zero-respecting algorithms:

###### Definition 1 (Distributed Zero-Respecting Intermittent Communication Algorithm).

We say that a parallel method is an intermittent communication algorithm if for each , there exists a mapping such that , the query on the machine during the round of communication, is computed as

 xmk,r=Amk,r([xm′k′,r′,g(xm′k′,r′)]m′∈[M],k′∈[K],r′

where

is a string of random bits that the algorithm may use for randomization. In addition, for a vector

, we define , and we say that an intermittent communication algorithm is distributed zero-respecting

 support(xmk,r)⊆⋃m′∈[M],k′∈[K],r′

In other words, each oracle query made by an intermittent communication algorithm must be computed based only on information about the objective that is available to the querying machine at the time—that is, stochastic gradients computed on the same machine earlier in the current round of communication, or computed on any machine in a previous round of communication. This is not a restriction on the algorithm but rather a specification of the distributed setting we consider.

An intermittent communication algorithm is also zero-respecting when each oracle query only has non-zero coordinates where previously-seen gradients were non-zero. This does slightly reduce the scope of the algorithms we consider, but it is a very broad class of algorithms which naturally generalizes “linear-span algorithms” (Nesterov, 2004). A wide range of optimization algorithms are distributed zero-respecting, including Minibatch SGD, Local SGD, accelerated variants of SGD, coordinate descent methods, and all other first-order intermittent communication algorithms that we are aware of. An algorithm that is not zero-respecting would be essentially “guessing” by making queries that are unrelated to previously seen stochastic gradients. Furthermore, work in other contexts has succeeded in proving that the min-max complexity for arbitrary randomized algorithms often matches the min-max complexity for zero-respecting algorithms, but at the expense of much more complicated proofs (e.g. Woodworth and Srebro, 2016; Carmon et al., 2017; Arjevani et al., 2020).

Finally, we are considering a “homogeneous” setting, where each of the machines have access to stochastic gradients from the same distribution, in contrast to the more challenging “heterogeneous” setting, where they come from different

distributions, which could arise in a machine learning context when each machine uses data from a different source. The heterogeneous setting is interesting, important, and widely studied, but we focus here on the more basic question of min-max rates for homogeneous distributed optimization. We point out that our lower bounds also apply to heterogeneous objectives since homogeneous optimization is a special case of heterogeneous optimization, and there are also some lower bounds specific to the heterogeneous setting

(e.g. Arjevani and Shamir, 2015) but they do not apply to our setting.

## 3 The Lower Bound

We now present our main result, which is a lower bound on what suboptimality can be guaranteed by any distributed zero-respecting intermittent communication algorithm in the worst case:

###### Theorem 1.

For any and any such that ,333We note that this restriction on is essentially without loss of generality since, by smoothness, and it is well-known that any algorithm that uses at most stochastic gradients will suffer suboptimality at least in the worst case (Nemirovsky and Yudin, 1983). there exists a convex, -smooth objective with and a stochastic gradient oracle with for all

such that with probability at least

, all of the oracle queries, , made by any distributed zero-respecting intermittent communication algorithm have suboptimality

 minm,k,rF(xmk,r)−F∗≥c⋅[HB2K2R2+σB√MKR+min{HB2R2log2M,σB√KR}]

Proof Sketch The first two terms of this lower bound follow directly from previous work (Woodworth et al., 2018); the term corresponds to optimizing a function with a deterministic gradient oracle, and the term is a very well-known statistical limit (see, e.g., Nemirovsky and Yudin, 1983). The distinguishing feature of our lower bound is the term, which depends differently on than on . For quadratics, the min-max complexity actually does depend only on the product , and is given by the two term only (Woodworth et al., 2020b). Consequently, proving our lower bound necessitates going beyond quadratics (in contrast, all the lower bounds for sequential smooth convex optimization that we are aware of can be obtained using quadratics). We therefore prove the Theorem using the following non-quadratic hard instance

 F(x)=ψ′(−ζ)x1+ψ(xN)+N−1∑i=1ψ(xi+1−xi) (4)

where is defined as

 ψ(x):=√Hx2βarctan(√Hβx2)−12β2log(1+Hβ2x24) (5)

and where , , and

are hyperparameters that are chosen depending on

so that satisfies the necessary conditions. This construction closely resembles the classic lower bound for deterministic first-order optimization of Nesterov (2004), which essentially replaces . To describe our stochastic gradient oracle, we will use , which denotes the highest index of a non-zero coordinate of . We also define to be equal to the objective with the term removed:

 F−(x)=ψ′(−ζ)x1+ψ(xN)+prog(x)−1∑i=1ψ(xi+1−xi)+N−1∑i=prog(x)+1ψ(xi+1−xi) (6)

The stochastic gradient oracle for is then given by

 g(x)={∇F−(x)with probability 1−p∇F(x)+1−pp(∇F(x)−∇F−(x))with% probability p (7)

This stochastic gradient oracle resembles the one used by Arjevani et al. (2019) to prove lower bounds for non-convex optimization, and its key property is that . Therefore, by the definition of a distributed zero-respecting algorithm, each oracle access only allows the algorithm to increase its progress with probability . The rest of the proof revolves around bounding the total progress of the algorithm and showing that if , then has high suboptimality.

Since each machine makes sequential queries and only makes progress with probability , the total progress scales like . By taking smaller, we decrease the amount of progress made by the algorithm, and therefore increase the lower bound. Indeed, when , the algorithm only increases its progress by about per round, which gives rise to the key term in the lower bound. However, we are constrained in how small we can take since our stochastic gradient oracle has variance

 supxE∥g(x)−∇F(x)∥2=2(1−p)psupxψ′(x)2 (8)

This is where our choice of comes in. Specifically, we chose the function to be convex and smooth so that is, but we also made it Lipschitz:

 ψ′(x)=√H2βarctan(√Hβx2)∈[−π√H4β,π√H4β] (9)

Notably, this Lipschitz bound on , which implies a bound on , is the key non-quadratic property that allows for our lower bound. Since is bounded, we are able to able to choose without violating the variance constraint on the stochastic gradient oracle. Carefully balancing completes the argument, the remaining details of which we defer to Appendix A.

Theorem 1 also implies a lower bound for strongly convex objectives:

###### Corollary 1.

No distributed zero-respecting intermittent communication algorithm can guarantee that for any -smooth, -strongly convex objective and stochastic gradient oracle with variance less than that with probability the output will have suboptimality

 F(^x)−F∗≤c⋅(F(0)−F∗K2R2exp(−√λHKR)+σ2λMKR+min{F(0)−F∗R2log2Mexp(−√λHRlogM),σ2λKR})

This lower bound is more limited than Theorem 1, since we prove it using a reduction from convex to strongly convex optimization, rather than directly. We also do not expect the exponential terms to be tight. Nevertheless, the Corollary gives some indication of the optimal rate in the strongly convex setting and, as with Theorem 1, it distinguishes between and unlike previous results. A simple proof can be found in Appendix A.

## 4 A Matching Upper Bound and an Optimal Algorithm

The lower bound in Theorem 1 is matched (up to factors) by the combination of two simple distributed zero-respecting algorithms, which are distributed variants of an accelerated SGD algorithm called AC-SA due to Lan (2012). In the sequential setting, AC-SA algorithm maintains two iterates and which it updates according to

 yt+1 =yt−γtgt(β−1tyt+(1−β−1t)xt) (10) xt+1 =β−1tyt+1+(1−β−1t)xt

where and are carefully chosen stepsize parameters. In the smooth, convex setting, this algorithm converges at a rate (see Corollary 1, Lan, 2012)

 E[F(xT)−F∗]≤c⋅(HB2T2+σB√T) (11)

To describe the optimal algorithm for the intermittent communication setting, we will first define two distributed variants of AC-SA.

The first algorithm, which we will refer to as Minibatch Accelerated SGD, implements iterations of AC-SA using minibatch gradients of size (c.f. Cotter et al., 2011). Specifically, the method maintains two iterates and which are shared across all the machines. During each round of communication, each machine computes independent stochastic estimates of ; the machines then communicate their minibatches, averaging them together into a larger minibatch of size , and then they update and according to (10). Because the minibatching reduces the variance of the stochastic gradients by a factor of , (11) implies this method converges at a rate

 E[F(xR)−F∗]≤c⋅(HB2R2+σB√MKR) (12)

The second algorithm, which we will call Single-Machine Accelerated SGD, “parallelizes” AC-SA in a different way. In contrast to Minibatch Accelerated SGD, Single-Machine Accelerated SGD simply ignores of the available machines and runs steps of AC-SA on the remaining one, therefore converging like

 E[F(xKR)−F∗]≤c⋅(HB2K2R2+σB√KR) (13)

From here, we point out that lower bound in Theorem 1 is equal (up to factors) to the minimum of (12) and (13). Furthermore, one can determine which of these algorithms achieves the minimum based on the problem parameters:

###### Theorem 2.

For any , the algorithm which returns the output of Minibatch Accelerated SGD when and returns the output of Single-Machine Accelerated SGD when is optimal up to a factor of .

This optimal algorithm is computationally efficient and requires no significant overhead. Each machine needs to store only a constant number of vectors, it performs only a constant number of vector additions for each stochastic gradient oracle access, and it communicates just one vector per round. Therefore, the total storage complexity is per machine, the sequential runtime complexity is , and the total communication complexity is . In fact, the communication complexity is when Single-Machine Accelerated SGD is used. Therefore, we do not expect a substantially better algorithm from the standpoint of computational efficiency either.

In light of Theorem 2 and the term in Theorem 1, we see that algorithms in this setting are offered the following dilemma: they may either attain the optimal statistical rate but suffer an optimization rate that does not benefit from at all, or they may attain the optimal optimization rate of but suffer a statistical rate as if only single machine were available. In this sense, there is a very real dichotomy between exploiting parallelism and leveraging local computation.

The main shortcoming of the optimal algorithm is the need to know the problem parameters , , and to implement it. However, knowledge of these parameters is anyway needed in order to choose the stepsizes for AC-SA, and we are not aware of accelerated variants of SGD that can be implemented without knowing them, even in the sequential setting. This algorithm is also somewhat unnatural because of the hard switch between Minibatch and Single-Machine Accelerated SGD. It would be nice, if only aesthetically, to have an algorithm that more naturally transitions from the Minibatch to the Single-Machine rate. Accelerated Local SGD (Yuan and Ma, 2020)

or something similar is a contender for such an algorithm, although it is unclear whether or not this method can match the optimal rate in all regimes. Local SGD methods can also be augmented by using two stepsizes—a smaller, conservative stepsize for the local updates between communications, and a larger, aggressive stepsize when the local updates are aggregated—this two-stepsize approach allows for interpolation between Minibatch-like and Single-Machine-like behavior, and could be used to design a more “natural” optimal algorithm

(see Section 6, Woodworth et al., 2020a).

## 5 Better than Optimal: Breaking the Lower Bound

Perhaps the most important use of a lower bound is in understanding how to break it. Instead of viewing the lower bound as telling us to give up any hope of improving over the naïve optimal method in Section 4, we should view it as informing us about possible means of making progress.

One way to break our lower bound is by introducing additional assumptions that are not satisfied by the hard instance. These assumptions could then be used to establish when and how some alternate method improves over the “optimal” method in Section 4. Several methods, which operate within the intermittent communication framework of Section 2, have been shown to be better than the “optimal algorithm” in practice for specific instances. However, attempts to demonstrate the benefit of these methods theoretically have so far failed, and we now understand why. In order to understand such benefits, we must introduce additional assumptions, and ask not “is this alternate method better” but rather “under what assumption is this alternate method better?” Below we suggest possible additional assumptions, including ones that have appeared in recent analysis and also other plausible assumptions one could rely on.

Another way to break the lower bound is by considering algorithms that go beyond the stochastic oracle framework of Section 2, utilizing more powerful oracles that nevertheless could be equally easy to implement. Understanding the lower bound can inform us of what type of such extensions might be useful, thus guiding development of novel types of optimization algorithms.

### 5.1 Relying on a Bounded Third Derivative

As we have mentioned, Theorem 1 does not hold in the special case of quadratic objectives of the form for p.s.d. , e.g. least squares problems, in which case the min-max rate is much better, and Accelerated Local SGD achieves:

 EQ(^x)−Q∗≤c⋅(HB2K2R2+σB√MKR) (14)

Since improvement over the lower bound is possible when the objective is exactly quadratic, it stands to reason that similar improvement should be possible when the objective is sufficiently close to quadratic. Indeed, Yuan and Ma (2020) analyze another accelerated variant of Local SGD in the smooth, convex setting with the additional assumption that the Hessian is -Lipschitz. Their algorithm converges at a rate

 EF(^x)−F∗≤~O(HB2KR2+σB√MKR+(Hσ2B4MKR3)1/3+(ασ2B5R4K)1/3) (15)

This can improve over the lower bound in Theorem 1 in certain parameter regimes, for instance, (15) is better if

 H2B2σ2≤R3MKandα≤~O(min{σR5/2B2K1/2,H3BKσ2R2log6M}) (16)

However, Yuan and Ma’s guarantee does not always improve over the lower bound, and it is not completely clear to what extent further improvement over their algorithm might be possible. In an effort to understand when it may or may not be possible to improve, we extend our lower bound to the case where is -Lipschitz:

###### Theorem 3.

For any and any such that , there exists a convex, -smooth objective with and with being -Lipschitz with respect to the L2 norm, and a stochastic gradient oracle with for all , such that with probability at least all of the oracle queries made by any distributed-zero-respecting intermittent communication algorithm will have suboptimality

 minm,k,rF(xmk,r)−F∗≥c⋅[HB2K2R2+σB√MKR+min{HB2R2log2M,√ασB2K1/4R2log7/4M,σB√KR}]

We prove this lower bound in Appendix A using the same construction (4) as we used for Theorem 1, but using the parameter to control the third derivative of . This lower bound does not match the guarantee of Yuan and Ma’s algorithm, so it does not resolve the min-max complexity. However, there is reason to suspect that the lower bound is closer to the min-max rate, at least in certain regimes. For instance, when is taken to zero, i.e. the objective becomes quadratic, we know that Theorem 3 is tight while (15) can be larger by a factor of . For that reason, we suspect that (15) is suboptimal, but further analysis will be needed. At any rate, our lower bound does establish that there is a limit to the utility of assuming a Lipschitz Hessian. Specifically, there can be no advantage over the optimal algorithm from Section 4 once .

Theorem 3 and Yuan and Ma’s algorithm also highlight a substantial qualitative difference between distributed and sequential optimization: in the sequential setting, there is never any advantage to assuming that the objective is close to quadratic. In fact, worst-case instances for sequential optimization are exactly quadratic (Nemirovsky and Yudin, 1983; Nesterov, 2004; Simchowitz, 2018).

Beyond requiring that the Hessian be Lipschitz, there are other ways of measuring an objective’s closeness to a quadratic. Two notable examples are self-concordance (Nesterov, 1998) and quasi-self-concordance (Bach et al., 2010), which bound the third derivative of in terms of the second derivative: we say that is -self-concordant when for all , satisfies and we say it is -quasi-self-concordant if . There has been recent interest in such objectives (Bach et al., 2010; Zhang and Xiao, 2015; Karimireddy et al., 2018; Carmon et al., 2020)

which arise e.g. in logistic regression problems. In Appendix

A, we extend the lower bound in Theorem 3 to these settings.

### 5.2 Statistical Learning Setting: Assumptions on Components

Stochastic optimization commonly arises in the context of statistical learning, where the goal is to minimize the expected loss with respect to a model’s parameters. In this case, the objective can be written , where represents data drawn i.i.d. from an unknown distribution, and the “components” represent the loss of the model parametrized by on the example .

In the setting of Theorem 1, we only place restrictions on the

itself, and on the first and second moments of

. However, in the statistical learning setting, it is often natural to assume that the loss function

itself satisfies particular properties for each individually. For instance, in our setting we might assume is convex and smooth and furthermore that the gradient oracle is given by for an i.i.d. . This is a non-trivial restriction on the stochastic gradient oracle, and it is conceivable that this property could be leveraged to design and analyze a method that converges faster than the lower bound in Theorem 1 would allow.

In particular, the specific stochastic gradient oracle (7) used to prove Theorem 1 cannot be written as the gradient of a random smooth function. In this sense, the lower bound construction is somewhat “unnatural,” however, we are not aware of any analysis that meaningfully444Numerous papers assume that and for some smooth, convex (e.g. Bottou et al., 2018; Nguyen et al., 2019; Koloskova et al., 2020; Woodworth et al., 2020a). Nevertheless, the purpose of this assumption is to bound or in terms of . In other words, one could prove the same guarantees in the setting of Theorem 1 with the additional constraint of the form for some parameter . Since the variance of the gradient oracle in our lower bound construction is bounded everywhere by a constant , it therefore applies to these analyses. exploits the fact that . An interesting question is whether such an assumption can be used to prove a better convergence guarantee, or whether Theorem 1 can be proven using a stochastic gradient oracle that obeys this constraint.

In the statistical learning setting, it is also natural to consider algorithms that can evaluate the gradient at multiple points for the same datum . Specifically, allowing the algorithm access to a pool of samples drawn i.i.d. from and to compute for any chosen and opens up additional possibilities. Indeed, Arjevani et al. (2019) showed that multiple—even just two—accesses to each component enables substantially faster convergence ( vs. ) in sequential stochastic non-convex optimization. Similar results have been shown for zeroth-order and bandit convex optimization (Agarwal et al., 2010; Duchi et al., 2015; Shamir, 2017; Nesterov and Spokoiny, 2017), where accessing each component twice allows for a quadratic improvement in the dimension-dependence.

In sequential smooth convex optimization, if has “finite-sum” structure (i.e.

is the uniform distribution on

), then allowing the algorithm to pick a component and access it multiple times opens the door to variance-reduction techniques like SVRG (Johnson and Zhang, 2013). These methods have updates of the form:

 xt+1=xt−ηt(∇f(xt;zt)−∇f(~x;zt)+∇F(~x)) (17)

Computing this update therefore requires evaluating the gradient of at two different points, which necessitates multiple accesses to a chosen component. This stronger oracle access allows faster rates compared with a single-access oracle (see discussion in, e.g., Arjevani et al., 2020).

Most relevantly, in the intermittent communication setting, distributed variants of SVRG are able to improve over the lower bound in Theorem 1 (Wang et al., 2017; Lee et al., 2017; Shamir, 2016; Woodworth et al., 2018). For example, in the intermittent communication setting when is -smooth and -Lipschitz, and where the algorithm can access each component multiple times, Woodworth et al. show that using distributed SVRG to optimize an empirical objective composed of suitably many samples is able to achieve convergence at the rate

 EF(^x)−F∗≤c⋅((HB2RK+LB√MKR)logMKRLB) (18)

While this guarantee (necessarily!) holds in a different setting than Theorem 1, the Lipschitz bound

is generally analogous to the standard deviation of the stochastic gradient variance,

(indeed, is an upper bound on ). With this in mind, this distributed SVRG algorithm can beat the lower bound in Theorem 1 when , , and are sufficiently large.

### 5.4 Higher Order and Other Stronger Oracles

Yet another avenue for improved algorithms in the intermittent communication setting is to use stronger stochastic oracles. For instance, a stochastic second-order oracle that estimates (Hendrikx et al., 2020) or a stochastic Hessian-vector product oracle that estimates given a vector , which can typically be computed as efficiently as stochastic gradients. In the statistical learning setting, some recent work also considers a stochastic prox oracle which returns (Wang et al., 2017; Chadha et al., 2021).

As an example, a stochastic Hessian-vector product oracle, in conjunction with a stochastic gradient oracle can be used to efficiently implement a distributed Newton algorithm. Specifically, the Newton update can be rewritten as

 xt+1=xt+ηtargminy{12y⊤∇2F(xt)y+∇F(xt)⊤y} (19)

That is, each update can be viewed as the solution to a quadratic optimization problem, and its stochastic gradients can be computed using stochastic Hessian-vector and gradient access to . The DiSCO algorithm (Zhang and Xiao, 2015) uses distributed preconditioned conjugate gradient descent to find an approximate Newton step. Alternatively, as previously discussed, this quadratic can be minimized to high accuracy using a single round of communication using Accelerated Local SGD. Under suitable assumptions (e.g., that is convex, smooth and self-concordant), this algorithm may converge substantially faster than the lower bounds in Theorems 1 and 3 would allow for first-order methods.

#### Differences from Sequential Setting:

Interestingly, in the sequential setting there is no benefit to using stochastic Hessian-vector products over and above what can be achieved using just a stochastic gradient oracle. This is because the worst-case instances are simply quadratic, in which case Hessian-vector products and gradients are essentially equivalent. This adds to a list of structures that facilitate distributed optimization while being essentially useless in the sequential setting. Likewise, objectives being quadratic or near-quadratic facilitates distributed optimization but does not help sequential algorithms since, again, the hard instances for sequential optimization are already quadratic. Furthermore, accessing a statistical learning gradient oracle multiple times can allow for faster distributed algorithms—e.g. distributed SVRG or using the stochastic gradients to implement stochastic Hessian-vector products via finite-differencing—but it does not generally help in the sequential case without further assumptions (like the problem having finite-sum structure).

### 5.5 Beyond Single-Sample Oracles

Another class of distributed optimization algorithms, which includes ADMM (Boyd et al., 2011) and DANE (Shamir et al., 2014), involve solving an optimization problem on each machine at each round of the form

 minx1KK∑k=1f(x;zmk,r)+λr,m∥x−yr,m∥2, (20)

where are components of the objective , and the vectors and scalars are chosen by the algorithm. Although these methods also involve processing samples, or components, at each round on each machine, and then communicating between the machines, they are quite distinct from the stochastic optimization algorithms we consider, and fall well outside the “stochastic optimization with intermittent communication” model we study. The main distinction is that in this paper we are focused on stochastic optimization methods, where each oracle access or “atomic operation” involves a single “data point” (a single component of a stochastic objective), or in our first-order model, a single stochastic gradient estimate, and can generally be performed in time , where is the dimensionality of . In particular, each round consists of separate accesses, and in all the methods we consider, can be implemented in time . In contrast, (20) is a complex optimization problem involving many data points, and cannot be solved with atomic operations555It could perhaps be approximately solved using a small number of passes over the data points, which would put us back within the scope what we study in this paper, but that is not how these method are generally analyzed.. This distinction results in the first term of the lower bound in Theorem 1, namely the “optimization term” , not applying for methods using (20). In particular, even ignoring machines and running the Mini-Batch Prox method (Wang et al., 2017) on a single machine results ensures a suboptimality of

 EF(^x)−F∗≤O(σB√KR), (21)

entirely avoiding the first term of Theorem 1, and beating the lower bound when is small.

Another difference is that DANE, as well as other methods which target Empirical Risk Minimization such as DiSCO (Zhang and Xiao, 2015) and AIDE (Reddi et al., 2016), work on the same batch of examples per machine in all rounds, i.e. they use with only (rather than ) random samples . In our setup and terminology, they thus require repeated access to components, as discussed above in Section 5.3. Furthermore, since they only use samples overall, they cannot guarantee suboptimality better than , a factor of worse than the second term in Theorem 1.

The Mini-Batch Prox guarantee (21) is disappointing, and suboptimal, once and are large, and DANE is not optimal, at least when is large. Understanding the min-max complexity of the class of methods which solve (20) at each round on each machine thus remains an important and interesting open problem. We note that lower bounds and the optimality of some of these methods were studied in Arjevani and Shamir (2015), but in a somewhat different, non-statistical distributed setting.

#### Acknowledgements

BW is grateful for the support of a Google PhD Research Fellowship. This work is also partially supported by NSF-CCF/BSF award 1718970/2016741, and NS is also supported by a Google Faculty Research Award.

## Appendix A Proof of Theorems 1 and 3

We construct a hard instance for the lower bound using the scalar functions :

 ψ(x)=√Hx2βarctan(√Hβx2)−12β2log(1+Hβ2x24) (22)

where is the parameter of smoothness, and is another parameter that controls the third derivative of which we will set later. The hard instance is then

 F(x)=−ψ′(ζ)x1+ψ(xN)+N−1∑i=1ψ(xi+1−xi) (23)

where and are additional parameters that will be chosen later. Lemma 2 below summarizes the relevant properties of , whose proof relies on the following bounds on :

###### Lemma 1.

For any ,

 |ψ′′′(x)| ≤H3/2β12 |ψ′′′(x)| ≤2βψ′′(x)3/2 |ψ′′′(x)| ≤√Hβ2ψ′′(x)
###### Proof.

The third derivative of is

 ψ′′′(x)=−2H2β2x(4+Hβ2x2)2 (24)

For the first claim, we first maximize the simpler function . We note that

 ddxx(1+x2)2=1−3x2(1+x2)3 (25) d2dx2x(1+x2)2=12x(x2−1)(1+x2)4 (26)

Therefore, the derivative is zero at and the second derivative is negative only for , furthermore, . Therefore, we conclude that

 maxx∈Rx(1+x2)2=maxx∈R|x|(1+x2)2=√13(1+√132)2=3√316 (27)

By rescaling, we conclude that

 maxx∈R∣∣ψ′′′(x)∣∣=maxx∈R2H2β2|x|(4+Hβ2x2)2=H3/2β4maxx∈R∣∣∣√Hβx2∣∣∣(1+(√Hβx2)2)2=3√3H3/2β64

This establishes the first claim. For the second claim, we observe that

 ∣∣ψ′′′(x)∣∣=2√Hβ2|x|√4+Hβ2x2ψ′′(x)3/2≤2√Hβ2|x|√Hβ2x2ψ′′(x)3/2=2βψ′′(x)3/2 (29)

Finally, for the third claim, we start by noting

 ∣∣ψ′′′(x)∣∣=2Hβ2|x|4+Hβ2x2ψ′′(x) (30)

We now consider the function , for which

 ddxx1+x2 =1−x2(1+x2)2 (31) d2dx2x1+x2 =2x(x2−3)(1+x2)3 (32)

We conclude that

 maxx∈R|x|1+x2=11+12=12 (33)

and therefore,

 (34)

This completes the proof. ∎

###### Lemma 2.

For any , , , and , is convex, -smooth, -self-concordant, -quasi-self-concordant, and .

###### Proof.

First, we note that . Therefore, is the sum of convex functions and is thus convex itself. We now compute the Hessian of :

 ∇2F(x)=ψ′′(xN)eNe⊤N+N−1∑i=1ψ′′(xi+1−xi)(ei+1−ei)(ei+1−ei)⊤ (35)

Therefore, for any ,

 u⊤∇2F(x)u ≤ψ′′(xN)u2N+N−1∑i=1ψ′′(xi+1−xi)(ui+1−ui)2 (36) ≤H4[u2N+N−1∑i=12u2i+1+2u2i] (37) ≤H∥u∥2 (38)

We conclude that and thus is -smooth.

Next, we compute the tensor of 3rd derivatives of

:

 ∇3F(x)=ψ′′′(xN)e⊗3N+N−1∑i=1ψ′′′(xi+1−xi)(ei+1−ei)⊗3 (39)

where

 ψ′′′(x)=−2H2β2x(4+Hβ2x2)2 (40)

Therefore, for any ,

 ∣∣∇3F(x)[u,u,u]∣∣≤∣∣ψ′′′(xN)u3N∣∣+N−1∑i=1∣∣ψ′′′(xi+1−xi)(ui+1−ui)3∣∣ (41)

We can bound this in several different ways using Lemma 1:

 |ψ′′′(x)| ≤H3/2β12 (42) |ψ′′′(x)| ≤2βψ′′(x)3/2 (43) |ψ′′′(x)| ≤√Hβ2ψ′′(x) (44)

Therefore,

 ∣∣∇3F(x)[u,u,u]∣∣ (45) ≤H3/2β12[|uN|3+8N−1∑i=1|ui+1|3+|ui|3] (46) ≤4H3/2β3∥u∥3 (47)

Above, we used that . We conclude that .

Similarly,

 |∇3F(x)[u,u,u]| ≤|ψ′′′(xN)||uN|3+N−1∑i=1|ψ′′′(xi+1−xi)||ui+1−ui|3 (48) ≤2β[ψ′′(xN)3/2(u2N)3/2+N−1∑i=1ψ′′(xi+1−xi)3/2((ui+1−ui)2)3/2] (49) ≤2β[ψ′′(xN)u2N+N−1∑i=1ψ′′(xi+1−xi)(ui+1−ui)2]3/2 (50) =2β⟨∇2F(x)u,u⟩3/2 (51)

For the final inequality, we used that . We conclude that is -self-concordant.

Finally,

 |∇3F(x)[u,u,u]| ≤|ψ′′′(xN)||uN|3+N−1∑i=1|ψ′′′(xi+1−xi)||ui+1−ui|3 (52) ≤√Hβ2[ψ′′(xN)|uN|3+N−1∑i=1ψ′′(xi+1−xi)|ui+1−ui|3] (53) ≤√Hβ2[ψ′′(xN)|uN|2+N−1∑i=1ψ′′(xi+1−xi)|ui+1−ui|2] (54) ⋅max{|uN|,max