The main problem of interest in this paper is the finite-sum convex programming (CP) problem given in the form of
Here, is a closed convex set, are smooth convex functions with Lipschitz continuous gradients over , i.e., such that
is a strongly convex function with modulus w.r.t. a norm , i.e.,
where denotes any subgradient (or gradient) of and is a given constant. Hence, the objective function is strongly convex whenever . For notational convenience, we also denote , , and . It is easy to see that for some ,
We also consider a class of stochastic finite-sum optimization problems given by
’s are random variables with support. It can be easily seen that (1.5) is a special case of (1.1) with . However, different from deterministic finite-sum optimization problems, only noisy gradient information of each component function can be accessed for the stochastic finite-sum optimization problem in (1.5).
The deterministic finite-sum problem (1.1) can model the empirical risk minimization in machine learning and statistical inferences, and hence has become the subject of intensive studies during the past few years. Our study on finite-sum problems (1.1) and (1.5) has also been motivated by the emerging need for distributed optimization and machine learning. Under such settings, each component function is associated with an agent , , which are connected through a distributed network. While different topologies can be considered for distributed optimization (see, e.g., Figure 2 and 2), in this paper, we focus on the star network where agents are connected to one central server, and all agents only communicate with the server (see Figure 2). These types of distributed optimization problems have several unique features. Firstly, they allow for data privacy, since no local data is stored in the server. Secondly, network agents behave independently and they may not be responsive at the same time. Thirdly, the communication between the server and agent can be expensive and has high latency. Finally, by considering the stochastic finite-sum optimization problem, we are interested in not only the deterministic empirical risk minimization, but also the generalization risk for distributed machine learning. Moreover, we allow the private data for each agent to be collected in an online (steaming) fashion. One typical example of the aforementioned distributed problems is Federated Learning recently introduced by Google in . As a particular example, in the
-regularized logistic regression problem, we have
is the loss function of agentwith training data , and is the penalty parameter. For minimization of the generalized risk, ’s are given in the form of expectation, i.e.,
where the random variable models the underlying distribution for training dataset of agent .
Note that another type of topology for distributed optimization is the multi-agent network without a central server, namely the decentralized setting, as shown in Figure 2, where the agents can only communicate with their neighbors to update information, please refer to [21, 32, 23] and reference therein for decentralized algorithms.
During the past few years, randomized incremental gradient (RIG) methods have emerged as an important class of first-order methods for finite-sum optimization (e.g.,[4, 16, 35, 8, 29, 22, 1, 14, 24]). For solving nonsmooth finite-sum problems, Nemirovski et al. [26, 27] showed that stochastic subgradient (mirror) descent methods can possibly save up to subgradient evaluations. By utilizing the smoothness properties of the objective, Lan 
showed that one can separate the impact of variance from other deterministic components for stochastic gradient descent and presented a new class of accelerated stochastic gradient descent methods to further improve these complexity bounds. However, the overall rate of convergence of these stochastic methods is still sublinear even for smooth and strongly finite-sum problems (see[11, 12]). Inspired by these works and the success of the incremental aggregated gradient method by Blatt et al., Schimidt et al.  presented a stochastic average gradient (SAG) method, which uses randomized sampling of to update the gradients, and can achieve a linear rate of convergence, i.e., an complexity bound, to solve unconstrained finite-sum problems (1.1). Johnson and Zhang later in 
presented a stochastic variance reduced gradient (SVRG) method, which computes an estimator ofby iteratively updating the gradient of one randomly selected of the current exact gradient information and re-evaluating the exact gradient from time to time. Xiao and Zhang  later extended SVRG to solve proximal finite-sum problems (1.1). All these methods exhibit an improved complexity bound, and Defazio et al.  also presented an improved SAG method, called SAGA, that can achieve such a complexity result. Comparing to the class of stochastic dual methods (e.g., [31, 30, 36]), each iteration of the RIG methods only involves the computation , rather than solving a more complicated subproblem
which may not have explicit solutions .
Noting that most of these RIG methods are not optimal even for , much recent research effort has been directed to the acceleration of RIG methods. In 2015, Lan and Zhou in  proposed a RIG method, namely randomized primal-dual gradient (RPDG) method, and show that its total number of gradient computations of can be bounded by
The RPDG method utilizes a direct acceleration without even using the concept of variance reduction, evolving from the randomized primal-dual methods developed in [36, 7] for solving saddle-point problems. Lan and Zhou  also established a lower complexity bound for the RIG methods by showing that the number of gradient evaluations of required by any RIG methods to find an -solution of (1.1), i.e., a point s.t. , cannot be smaller than
whenever the dimension
where is the total number of iterations and . Simultaneously, Lin et al.  presented a catalyst scheme which utilizes a restarting technique to accelerate the SAG method in  (or other “non-accelerated” first-order methods) and thus can possibly improve the complexity bounds obtained by SVRG and SAGA to (1.6) (under the Euclidean setting). Allen-Zhu  later showed that one can also directly accelerate SVRG to achieve the optimal rate of convergence (1.6). All these accelerated RIG methods can save up to in the number of gradient evaluations of comparing to optimal deterministic first-order methods when .
It should be noted that most existing RIG methods were inspired by empirical risk minimization on a single server (or cluster) in machine learning rather than on a set of agents distributed over a network. Under the distributed setting, methods requiring full gradient computation and/or restarting from time to time may incur extra communication and synchronization costs. As a consequence, methods which require fewer full gradient computations (e.g. SAG, SAGA and RPDG) seem to be more advantageous in this regard. An interesting but yet unresolved question in stochastic optimization is whether there exists a method which does not require the computation of any full gradients (even at the initial point), but can still achieve the optimal rate of convergence in (1.6). Moreover, little attention in the study of RIG methods has been paid to the stochastic finite-sum problem in (1.5), which is important for generalization risk minimization in machine learning. Very recently, there are some progresses on stochastic primal-dual type methods for solving problem (1.5). For example, Lan, Lee and Zhou  proposed a stochastic decentralized communication sliding method that can achieve the optimal sampling complexity of and best-known complexity bounds for communication rounds for solving stochastic decentralized strongly convex problems. For the distributed setting with a central sever, by using mini-batch technique to collect gradient information and any stochastic gradient based algorithm as a black box to update iterates, Dekel et al.  presented a distributed mini-batch algorithm with a batch size of that can obtain sampling complexity (i.e., number of stochastic gradients) for stochastic strongly convex problems, and hence implies at least bound for communication complexity. An asynchronous version was later proposed by Feyzmahdavian et al. in  that maintained the above convergence rate for regularized stochastic strongly convex problems. It should be pointed out that these mini-batch based distributed algorithms require sampling from all network agents iteratively and hence leads to at least rate of convergence in terms of communication costs among server and agents. It is unknown whether there exists an algorithm which only requires a significantly smaller communication rounds (e.g. ), but can achieve the optimal sampling complexity for solving the stochastic finite-sum problem in (1.5).
The main contribution of this paper is to introduce a new randomized incremental gradient type method to solve (1.1) and (1.5). Firstly, we develop a random gradient extrapolation method (RGEM) for solving (1.1) that does not require any exact gradient evaluations of . For strongly convex problems, we demonstrate that RGEM can still achieve the optimal rate of convergence (1.6) under the assumption that the average of gradients of at the initial point is bounded by . To the best of our knowledge, this is the first time that such an optimal RIG methods without any exact gradient evaluations has been presented for solving (1.1) in the literature. In fact, without any full gradient computation, RGEM possesses iteration costs as low as pure stochastic gradient descent (SGD) methods, but achieves a much faster and optimal linear rate of convergence for solving deterministic finite-sum problems. In comparison with the well-known randomized Kaczmarz method , which can be viewed as an enhanced version of SGD, but can achieve a linear rate of convergence for solving linear systems, RGEM has a better convergence rate in terms of the dependence on the condition number . Secondly, we develop a stochastic version of RGEM and establish its optimal convergence properties for solving stochastic finite-sum problems (1.5). More specifically, we assume that only noisy first-order information of one randomly selected component function can be accessed via a stochastic first-order () oracle iteratively. In other words, at each iteration only one randomly selected network agent needs to compute an estimator of its gradient by sampling from its local data using a oracle instead of performing exact gradient evaluation of its component function . Note that for these problems, it is difficult to compute the exact gradients even at the initial point. Under standard assumptions for centralized stochastic optimization, i.e., the gradient estimators computed by the oracle are unbiased and have bounded variance , the number of stochastic gradient evaluations performed by RGEM to solve (1.5) can be bounded by111 indicates the rate of convergence is up to a logarithmic factor - .
for finding a point s.t. . Moreover, by utilizing the mini-batch technique, RGEM can achieve an
complexity bound in terms of the number of communication rounds, and each round only involves the communication between the server and a randomly selected agent. This bound seems to be optimal, since it matches the lower complexity bound for RIG methods to solve deterministic finite-sum problems. It is worth noting that the former bound (1.8) is independent of the number of agents , while the latter one (1.9) only linearly depends on or even for ill-conditioned problems. To the best of our knowledge, this is the first time that such a RIG type method has been developed for solving stochastic finite-sum problems (1.5) that can achieve the optimal communication complexity and nearly optimal (up to a logarithmic factor) sampling complexity in the literature.
RGEM is developed based on a novel algorithmic framework, namely gradient extrapolation method (GEM), that we introduce in this paper for solving black-box convex optimization (i.e., ). The development of GEM was inspired by our recent studies on the relation between accelerated gradient methods and the primal-dual gradient methods. In particular, it is observed in  that Nesterov’s accelerated gradient method is a special primal-dual gradient (PDG) method where the extrapolation step is performed in the primal space. Such a primal extrapolation step, however, might result in a search point outside the feasible region under the randomized setting in the RPDG method mentioned above. In view of this deficiency of PDG and RPDG methods, we propose to switch the primal and dual spaces for primal-dual gradient methods, and to perform the extrapolation step in the dual (gradient) space. The resulting new first-order method, i.e., GEM, can be viewed as a dual version of Nesterov’s accelerated gradient method, and we show that it can also achieve the optimal rate of convergence for black-box convex optimization.
RGEM is a randomized version of GEM which only computes the gradient of a randomly selected component function at each iteration. It utilizes the gradient extrapolation step also for estimating exact gradients in addition to predicting dual information as in GEM. As a result, it has several advantages over RPDG. Firstly, RPDG requires a restricted assumption that each has to be differentiable and has Lipschitz continuous gradients over the whole due to its primal extrapolation step. RGEM relaxes this assumption to having Lipschitz gradients over the feasible set (see (1.2)), and hence can be applied to a much broader class of problems. Secondly, RGEM possesses simpler convergence analysis carried out in the primal space due to its simplified algorithmic scheme. However, RPDG has a complicated algorithmic scheme, which contains a primal extrapolation step and a gradient (dual) prediction step in addition to solving a primal proximal subproblem, and thus leads to an intricate primal-dual convergence analysis. Last but not least, it is unknown whether RPDG could maintain the optimal convergence rate (1.6) without the exact gradient evaluation of during initialization.
This paper is organized as follows. In Section 2 we present the proposed random gradient extrapolation methods (RGEM), and their convergence properties for solving (1.1) and (1.5). In order to provide more insights into the design of the algorithmic scheme of RGEM, we provide an introduction to the gradient extrapolation method (GEM) and its relation to the primal-dual gradient method, as well as Nesterov’s method in Section 3. Section 4 is devoted to the convergence analysis of RGEM. Some concluding remarks are made in Section 5.
1.1 Notation and terminology
We use to denote a general norm in without specific mention. We also use to denote the conjugate norm of . For any , denotes the standard -norm in , i.e., For any convex function , is the set of subdifferential at . For a given strongly convex function with modulus (see (1.1)), we define a prox-function associated with as
where is an arbitrary subgradient of at . By the strong convexity of , we have
It should be pointed out that the prox-function described above is a generalized Bregman distance in the sense that is not necessarily differentiable. This is different from the standard definition for Bregman distance [5, 2, 3, 17, 6]. Throughout this paper, we assume that the prox-mapping associated with and , given by
is easily computable for any . For any real number , and denote the nearest integer to from above and below, respectively. and , respectively, denote the set of nonnegative and positive real numbers.
2 Algorithms and main results
This section contains three subsections. We first present in Subsection 2.1 an optimal random gradient extrapolation method (RGEM) for solving the distributed finite-sum problem in (1.1), and then discuss in Subsection 2.2, a stochastic version of RGEM for solving the stochastic finite-sum problem in (1.5). Subsection 2.3 is devoted to the implementation of RGEM in a distributed setting and the discussion about its communication complexity.
2.1 RGEM for deterministic finite-sum optimization
The basic scheme of RGEM is formally stated in Algorithm 1. This algorithm simply initializes the gradient as . At each iteration, RGEM requires the new gradient information of only one randomly selected component function , but maintains pairs of search points and gradients , , which are stored by their corresponding agents in the distributed network. More specifically, it first performs a gradient extrapolation step in (2.13) and the primal proximal mapping in (2.14). Then a randomly selected block is updated in (2.15) and the corresponding component gradient is computed in (2.16). As can be seen from Algorithm 1, RGEM does not require any exact gradient evaluations.
Note that the computation of in (2.14) requires an involved computation of . In order to save computational time when implementing this algorithm, we suggest to compute this quantity in a recursive manner as follows. Let us denote . Clearly, in view of the fact that , , we have
Also, by the definition of and (2.13), we have
Using these two ideas mentioned above, we can compute in two steps: i) initialize , and update as in (2.18) after the gradient evaluation step (2.16); ii) replace (2.13) by (2.19) to compute . Also note that the difference can be saved as it is used in both (2.18) and (2.19) for the next iteration. These enhancements will be incorporated into the distributed setting in Subsection 2.3 to possibly save communication costs.
It is also interesting to observe the differences between RGEM and RPDG . RGEM has only one extrapolation step (2.13) which combines two types of predictions. One is to predict future gradients using historic data, and the other is to obtain an estimator of the current exact gradient of from the randomly updated gradient information of . However, RPDG method needs two extrapolation steps in both the primal and dual spaces. Due to the existence of the primal extrapolation step, RPDG cannot guarantee the search points where it performs gradient evaluations to fall within the feasible set . Hence, it requires the assumption that ’s are differentiable with Lipschitz continuous gradients over . Such a strong assumption is not required by RGEM, since all the primal iterates generated by RGEM stay within the feasible region . As a result, RGEM can deal with a much wider class of problems than RPDG. Moreover, RGEM allows no exact gradient computation for initialization, which provides a fully-distributed algorithmic framework under the assumption that there exists such that
where is the given initial point.
We now provide a constant step-size policy for RGEM to solve strongly convex problems given in the form of (1.1) and show that the resulting algorithm exhibits an optimal linear rate of convergence in Theorem 2.1. The proof of Theorem 2.1 can be found in Subsection 4.1.
If (2.20) holds and is set as
In view of Theorem 2.1, we can provide bounds on the total number of gradient evaluations performed by RGEM to find a stochastic -solution of problem (1.1), i.e., a point s.t. . Theorem 2.1 implies the number of gradient evaluations of performed by RGEM to find a stochastic -solution of (1.1) can be bounded by
Here . Therefore, whenever is dominating, and and are in the same order of magnitude, RGEM can save up to gradient evaluations of the component function than the optimal deterministic first-order methods. More specifically, RGEM does not require any exact gradient computation and its communication cost is similar to pure stochastic gradient descent. To the best of our knowledge, it is the first time that such an optimal RIG method is presented for solving (1.1) in the literature. It should be pointed out that while the rates of convergence of RGEM obtained in Theorem 2.1 is stated in terms of expectation, we can develop large-deviation results for these rates of convergence using similar techniques in  for solving strongly convex problems.
Furthermore, if a one-time exact gradient evaluation is available at the initial point, i.e., , we can drop the assumption in (2.20) and employ a more aggressive stepsize policy with
Similarly, we can demonstrate that the number of gradient evaluations of performed by RGEM with this initialization method to find a stochastic -solution can be bounded by
2.2 RGEM for stochastic finite-sum optimization
We discuss in this subsection the stochastic finite-sum optimization and online learning problems, where only noisy gradient information of can be accessed via a stochastic first-order () oracle. In particular, for any given point , the
oracle outputs a vectors.t.
We also assume that throughout this subsection that the is associated with the inner product .
As shown in Algorithm 2, the RGEM for stochastic finite-sum optimization is naturally obtained by replacing the gradient evaluation of in Algorithm 1 (see (2.16)) with a stochastic gradient estimator of given in (2.29). In particular, at each iteration, we collect number of stochastic gradients of only one randomly selected component and take their average as the stochastic estimator of . Moreover, it needs to be mentioned that the way RGEM initializes gradients, i.e, , is very important for stochastic optimization, since it is usually impossible to compute exact gradient for expectation functions even at the initial point.
Under the standard assumptions in (2.27) and (2.28) for stochastic optimization, and with proper choices of algorithmic parameters, Theorem 2.2 shows that RGEM can achieve the optimal rate of convergence (up to a certain logarithmic factor) for solving strongly convex problems given in the form of (1.5) in terms of the number of stochastic gradients of . The proof of the this result can be found in Subsection 4.2.
Let be an optimal solution of (1.5), and be generated by Algorithm 2, and . Suppose that and are defined in (2.20) and (2.28), respectively. Given the iteration limit , let , and be set to (2.21) with being set as (2.22), and we also set
where the expectation is taken w.r.t. and and
Furthermore, in view of (2.31) this iteration complexity bound can be improved to
in terms of finding a point s.t. . Therefore, the corresponding number of stochastic gradient evaluations performed by RGEM for solving problem (1.5) can be bounded by
which together with (2.33) imply that the total number of required stochastic gradients or samples of the random variables , can be bounded by
Observe that this bound does not depend on the number of terms for small enough . To the best of our knowledge, it is the first time that such a convergence result is established for RIG algorithms to solve distributed stochastic finite-sum problems. This complexity bound in fact is in the same order of magnitude (up to a logarithmic factor) as the complexity bound achieved by the optimal accelerated stochastic approximation methods [11, 12, 19], which uniformly sample all the random variables . However, this latter approach will thus involve much higher communication costs in the distributed setting (see Subsection 2.3 for more discussions).
2.3 RGEM for distributed optimization and machine learning
This subsection is devoted to RGEMs (see Algorithm 1 and Algorithm 2) from two different perspectives, i.e., the server and the activated agent under a distributed setting. We also discuss the communication costs incurred by RGEM under this setting.
Both the server and agents in the distributed network start with the same global initial point , i.e., and the server also sets and . During the process of RGEM, the server updates iterate and calculates the output solution (cf. (2.17)) which is given by . Each agent only stores its local variable and updates it according to the information received from the server (i.e., ) when activated. The activated agent also needs to upload the changes of gradient to the server. Observe that since might be sparse, uploading it will incur smaller amount of communication costs than uploading the new gradient . Note that line 5 of RGEM from the -th agent’s perspective is optional if the agent saves historic gradient information from the last update.
We now add some remarks about the potential benefits of RGEM for distributed optimization and machine learning. Firstly, since RGEM does not require any exact gradient evaluation of , it does not need to wait for the responses from all agents in order to compute an exact gradient. Each iteration of RGEM only involves communication between the server and the activated -th agent. In fact, RGEM will move to the next iteration in case no response is received from the
-th agent. This scheme works under the assumption that the probability for any agent being responsive or available at a certain point of time is equal. However, all other optimal RIG algorithms, except RPDG, need the exact gradient information from all network agents once in a while, which incurs high communication costs and synchronous delays as long as one agent is not responsive. Even RPDG requires a full round of communications and synchronization at the initial point.
Secondly, since each iteration of RGEM involves only constant number of communication rounds between the server and one selected agent, the communication complexity for RGEM under distributed setting can be bounded by
Therefore, it can save up to rounds of communication than the optimal deterministic first-order methods.
For solving distributed stochastic finite-sum optimization problems (1.5), RGEM from the -th agent’s perspective will be slightly modified as follows.
Similar to the case for the deterministic finite-sum optimization, the total number of communication rounds performed by the above RGEM can be bounded by
for solving (1.5). Each round of communication only involves the server and a randomly selected agent. This communication complexity seems to be optimal, since it matches the lower complexity bound (1.7) established in . Moreover, the sampling complexity, i.e., the total number of samples to be collected by all the agents, is also nearly optimal and comparable to the case when all these samples are collected in a centralized location and processed by an optimal stochastic approximation method. On the other hand, if one applies an existing optimal stochastic approximation method to solve the distributed stochastic optimization problem, the communication complexity will be as high as , which is much worse than RGEM.
3 Gradient extrapolation method: dual of Nesterov’s acceleration
Our goal in this section is to introduce a new algorithmic framework, referred to as the gradient extrapolation method (GEM), for solving the convex optimization problem given by
We show that GEM can be viewed as a dual of Nesterov’s accelerated gradient method although these two algorithms appear to be quite different. Moreover, GEM possess some nice properties which enable us to develop and analyze the random gradient extrapolation method for distributed and stochastic optimization.
3.1 Generalized Bregman distance
Note that whenever is non-differentiable, we need to specify a particular selection of the subgradient before performing the prox-mapping. We assume throughout this paper that such a selection of is defined recursively as follows. Denote . By the optimality condition of (1.12), we have
where denotes the normal cone of at . Once such a satisfying the above relation is identified, we will use it as a subgradient when defining in the next iteration. Note that such a subgradient can be identified as long as is obtained, since it satisfies the optimality condition of (1.12).
Let be a closed convex set and a point be given. Also let be a convex function and
for some . Assume that the function satisfies
for some . Also assume that the scalars and are chosen such that . If
then for any , we have
3.2 The algorithm
As shown in Algorithm 3, GEM starts with a gradient extrapolation step (3.2) to compute from the two previous gradients and . Based on , it performs a proximal gradient descent step in (3.3) and updates the output solution . Finally, the gradient at is computed for gradient extrapolation in the next iteration. This algorithm is a special case of RGEM in Algorithm 1 (with ).
We now show that GEM can be viewed as the dual of the well-known Nesterov’s accelerated gradient (NAG) method as studied in . To see such a relationship, we will first rewrite GEM in a primal-dual form. Let us consider the dual space , where the gradients of reside, and equip it with the conjugate norm . Let be the conjugate function of such that We can reformulate the original problem in (3.1) as the following saddle point problem:
It is clear that is strongly convex with modulus w.r.t. (See Chapter E in  for details). Therefore, we can define its associated dual generalized Bregman distance and dual prox-mappings as