 # Random gradient extrapolation for distributed and stochastic optimization

In this paper, we consider a class of finite-sum convex optimization problems defined over a distributed multiagent network with m agents connected to a central server. In particular, the objective function consists of the average of m (> 1) smooth components associated with each network agent together with a strongly convex term. Our major contribution is to develop a new randomized incremental gradient algorithm, namely random gradient extrapolation method (RGEM), which does not require any exact gradient evaluation even for the initial point, but can achieve the optimal O((1/ϵ)) complexity bound in terms of the total number of gradient evaluations of component functions to solve the finite-sum problems. Furthermore, we demonstrate that for stochastic finite-sum optimization problems, RGEM maintains the optimal O(1/ϵ) complexity (up to a certain logarithmic factor) in terms of the number of stochastic gradient computations, but attains an O((1/ϵ)) complexity in terms of communication rounds (each round involves only one agent). It is worth noting that the former bound is independent of the number of agents m, while the latter one only linearly depends on m or even √(m) for ill-conditioned problems. To the best of our knowledge, this is the first time that these complexity bounds have been obtained for distributed and stochastic optimization problems. Moreover, our algorithms were developed based on a novel dual perspective of Nesterov's accelerated gradient method.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The main problem of interest in this paper is the finite-sum convex programming (CP) problem given in the form of

 ψ∗:=minx∈X{ψ(x):=1m∑mi=1fi(x)+μw(x)}. (1.1)

Here, is a closed convex set, are smooth convex functions with Lipschitz continuous gradients over , i.e., such that

 ∥∇fi(x1)−∇fi(x2)∥∗≤Li∥x1−x2∥,  ∀x1,x2∈X, (1.2)

is a strongly convex function with modulus w.r.t. a norm , i.e.,

 w(x1)−w(x2)−⟨w′(x2),x1−x2⟩≥12∥x1−x2∥2,  ∀x1,x2∈X, (1.3)

where denotes any subgradient (or gradient) of and is a given constant. Hence, the objective function is strongly convex whenever . For notational convenience, we also denote , , and . It is easy to see that for some ,

 ∥∇f(x1)−∇f(x2)∥∗≤Lf∥x1−x2∥≤L∥x1−x2∥,  ∀x1,x2∈X. (1.4)

We also consider a class of stochastic finite-sum optimization problems given by

 ψ∗:=minx∈X{ψ(x):=1m∑mi=1Eξi[Fi(x,ξi)]+μw(x)}, (1.5)

where

’s are random variables with support

. It can be easily seen that (1.5) is a special case of (1.1) with . However, different from deterministic finite-sum optimization problems, only noisy gradient information of each component function can be accessed for the stochastic finite-sum optimization problem in (1.5).

The deterministic finite-sum problem (1.1) can model the empirical risk minimization in machine learning and statistical inferences, and hence has become the subject of intensive studies during the past few years. Our study on finite-sum problems (1.1) and (1.5) has also been motivated by the emerging need for distributed optimization and machine learning. Under such settings, each component function is associated with an agent , , which are connected through a distributed network. While different topologies can be considered for distributed optimization (see, e.g., Figure 2 and 2), in this paper, we focus on the star network where agents are connected to one central server, and all agents only communicate with the server (see Figure 2). These types of distributed optimization problems have several unique features. Firstly, they allow for data privacy, since no local data is stored in the server. Secondly, network agents behave independently and they may not be responsive at the same time. Thirdly, the communication between the server and agent can be expensive and has high latency. Finally, by considering the stochastic finite-sum optimization problem, we are interested in not only the deterministic empirical risk minimization, but also the generalization risk for distributed machine learning. Moreover, we allow the private data for each agent to be collected in an online (steaming) fashion. One typical example of the aforementioned distributed problems is Federated Learning recently introduced by Google in . As a particular example, in the

-regularized logistic regression problem, we have

 fi(x)=li(x):=1Ni∑Nij=1log(1+exp(−bijaijTx)), i=1,…,m,  w(x)=R(x):=12∥x∥22,

provided that

is the loss function of agent

with training data , and is the penalty parameter. For minimization of the generalized risk, ’s are given in the form of expectation, i.e.,

 fi(x)=li(x):=Eξi[log(1+exp(−ξTix))], i=1,…,m,

where the random variable models the underlying distribution for training dataset of agent . Figure 1: A distributed network with 5 agents and one server

Note that another type of topology for distributed optimization is the multi-agent network without a central server, namely the decentralized setting, as shown in Figure 2, where the agents can only communicate with their neighbors to update information, please refer to [21, 32, 23] and reference therein for decentralized algorithms.

During the past few years, randomized incremental gradient (RIG) methods have emerged as an important class of first-order methods for finite-sum optimization (e.g.,[4, 16, 35, 8, 29, 22, 1, 14, 24]). For solving nonsmooth finite-sum problems, Nemirovski et al. [26, 27] showed that stochastic subgradient (mirror) descent methods can possibly save up to subgradient evaluations. By utilizing the smoothness properties of the objective, Lan 

showed that one can separate the impact of variance from other deterministic components for stochastic gradient descent and presented a new class of accelerated stochastic gradient descent methods to further improve these complexity bounds. However, the overall rate of convergence of these stochastic methods is still sublinear even for smooth and strongly finite-sum problems (see

[11, 12]). Inspired by these works and the success of the incremental aggregated gradient method by Blatt et al., Schimidt et al.  presented a stochastic average gradient (SAG) method, which uses randomized sampling of to update the gradients, and can achieve a linear rate of convergence, i.e., an complexity bound, to solve unconstrained finite-sum problems (1.1). Johnson and Zhang later in 

presented a stochastic variance reduced gradient (SVRG) method, which computes an estimator of

by iteratively updating the gradient of one randomly selected of the current exact gradient information and re-evaluating the exact gradient from time to time. Xiao and Zhang  later extended SVRG to solve proximal finite-sum problems (1.1). All these methods exhibit an improved complexity bound, and Defazio et al.  also presented an improved SAG method, called SAGA, that can achieve such a complexity result. Comparing to the class of stochastic dual methods (e.g., [31, 30, 36]), each iteration of the RIG methods only involves the computation , rather than solving a more complicated subproblem

 argmin{⟨g,y⟩+f∗i(y)+∥y∥2∗},

which may not have explicit solutions .

Noting that most of these RIG methods are not optimal even for , much recent research effort has been directed to the acceleration of RIG methods. In 2015, Lan and Zhou in  proposed a RIG method, namely randomized primal-dual gradient (RPDG) method, and show that its total number of gradient computations of can be bounded by

 O{(m+√mLμ)log1ϵ}. (1.6)

The RPDG method utilizes a direct acceleration without even using the concept of variance reduction, evolving from the randomized primal-dual methods developed in [36, 7] for solving saddle-point problems. Lan and Zhou  also established a lower complexity bound for the RIG methods by showing that the number of gradient evaluations of required by any RIG methods to find an -solution of (1.1), i.e., a point s.t. , cannot be smaller than

 Ω((m+√mLμ)log1ϵ), (1.7)

whenever the dimension

 n≥(k+m/2)/log(1/q),

where is the total number of iterations and . Simultaneously, Lin et al.  presented a catalyst scheme which utilizes a restarting technique to accelerate the SAG method in  (or other “non-accelerated” first-order methods) and thus can possibly improve the complexity bounds obtained by SVRG and SAGA to (1.6) (under the Euclidean setting). Allen-Zhu  later showed that one can also directly accelerate SVRG to achieve the optimal rate of convergence (1.6). All these accelerated RIG methods can save up to in the number of gradient evaluations of comparing to optimal deterministic first-order methods when .

It should be noted that most existing RIG methods were inspired by empirical risk minimization on a single server (or cluster) in machine learning rather than on a set of agents distributed over a network. Under the distributed setting, methods requiring full gradient computation and/or restarting from time to time may incur extra communication and synchronization costs. As a consequence, methods which require fewer full gradient computations (e.g. SAG, SAGA and RPDG) seem to be more advantageous in this regard. An interesting but yet unresolved question in stochastic optimization is whether there exists a method which does not require the computation of any full gradients (even at the initial point), but can still achieve the optimal rate of convergence in (1.6). Moreover, little attention in the study of RIG methods has been paid to the stochastic finite-sum problem in (1.5), which is important for generalization risk minimization in machine learning. Very recently, there are some progresses on stochastic primal-dual type methods for solving problem (1.5). For example, Lan, Lee and Zhou  proposed a stochastic decentralized communication sliding method that can achieve the optimal sampling complexity of and best-known complexity bounds for communication rounds for solving stochastic decentralized strongly convex problems. For the distributed setting with a central sever, by using mini-batch technique to collect gradient information and any stochastic gradient based algorithm as a black box to update iterates, Dekel et al.  presented a distributed mini-batch algorithm with a batch size of that can obtain sampling complexity (i.e., number of stochastic gradients) for stochastic strongly convex problems, and hence implies at least bound for communication complexity. An asynchronous version was later proposed by Feyzmahdavian et al. in  that maintained the above convergence rate for regularized stochastic strongly convex problems. It should be pointed out that these mini-batch based distributed algorithms require sampling from all network agents iteratively and hence leads to at least rate of convergence in terms of communication costs among server and agents. It is unknown whether there exists an algorithm which only requires a significantly smaller communication rounds (e.g. ), but can achieve the optimal sampling complexity for solving the stochastic finite-sum problem in (1.5).

 ~O{σ20/m+σ2μ2ϵ+μ∥x0−x∗∥22+ψ(x0)−ψ∗μϵ}, (1.8)

for finding a point s.t. . Moreover, by utilizing the mini-batch technique, RGEM can achieve an

 O{(m+√m^Lμ)log1ϵ}, (1.9)

complexity bound in terms of the number of communication rounds, and each round only involves the communication between the server and a randomly selected agent. This bound seems to be optimal, since it matches the lower complexity bound for RIG methods to solve deterministic finite-sum problems. It is worth noting that the former bound (1.8) is independent of the number of agents , while the latter one (1.9) only linearly depends on or even for ill-conditioned problems. To the best of our knowledge, this is the first time that such a RIG type method has been developed for solving stochastic finite-sum problems (1.5) that can achieve the optimal communication complexity and nearly optimal (up to a logarithmic factor) sampling complexity in the literature.

RGEM is developed based on a novel algorithmic framework, namely gradient extrapolation method (GEM), that we introduce in this paper for solving black-box convex optimization (i.e., ). The development of GEM was inspired by our recent studies on the relation between accelerated gradient methods and the primal-dual gradient methods. In particular, it is observed in  that Nesterov’s accelerated gradient method is a special primal-dual gradient (PDG) method where the extrapolation step is performed in the primal space. Such a primal extrapolation step, however, might result in a search point outside the feasible region under the randomized setting in the RPDG method mentioned above. In view of this deficiency of PDG and RPDG methods, we propose to switch the primal and dual spaces for primal-dual gradient methods, and to perform the extrapolation step in the dual (gradient) space. The resulting new first-order method, i.e., GEM, can be viewed as a dual version of Nesterov’s accelerated gradient method, and we show that it can also achieve the optimal rate of convergence for black-box convex optimization.

This paper is organized as follows. In Section 2 we present the proposed random gradient extrapolation methods (RGEM), and their convergence properties for solving (1.1) and (1.5). In order to provide more insights into the design of the algorithmic scheme of RGEM, we provide an introduction to the gradient extrapolation method (GEM) and its relation to the primal-dual gradient method, as well as Nesterov’s method in Section 3. Section 4 is devoted to the convergence analysis of RGEM. Some concluding remarks are made in Section 5.

### 1.1 Notation and terminology

We use to denote a general norm in without specific mention. We also use to denote the conjugate norm of . For any , denotes the standard -norm in , i.e., For any convex function , is the set of subdifferential at . For a given strongly convex function with modulus (see (1.1)), we define a prox-function associated with as

 P(x0,x)≡Pw(x0,x):=w(x)−[w(x0)+⟨w′(x0),x−x0⟩], (1.10)

where is an arbitrary subgradient of at . By the strong convexity of , we have

 P(x0,x)≥12∥x−x0∥2,  ∀x,x0∈X. (1.11)

It should be pointed out that the prox-function described above is a generalized Bregman distance in the sense that is not necessarily differentiable. This is different from the standard definition for Bregman distance [5, 2, 3, 17, 6]. Throughout this paper, we assume that the prox-mapping associated with and , given by

 MX(g,x0,η):=argminx∈X{⟨g,x⟩+μw(x)+ηP(x0,x)}, (1.12)

is easily computable for any . For any real number , and denote the nearest integer to from above and below, respectively. and , respectively, denote the set of nonnegative and positive real numbers.

## 2 Algorithms and main results

This section contains three subsections. We first present in Subsection 2.1 an optimal random gradient extrapolation method (RGEM) for solving the distributed finite-sum problem in (1.1), and then discuss in Subsection 2.2, a stochastic version of RGEM for solving the stochastic finite-sum problem in (1.5). Subsection 2.3 is devoted to the implementation of RGEM in a distributed setting and the discussion about its communication complexity.

### 2.1 RGEM for deterministic finite-sum optimization

The basic scheme of RGEM is formally stated in Algorithm 1. This algorithm simply initializes the gradient as . At each iteration, RGEM requires the new gradient information of only one randomly selected component function , but maintains pairs of search points and gradients , , which are stored by their corresponding agents in the distributed network. More specifically, it first performs a gradient extrapolation step in (2.13) and the primal proximal mapping in (2.14). Then a randomly selected block is updated in (2.15) and the corresponding component gradient is computed in (2.16). As can be seen from Algorithm 1, RGEM does not require any exact gradient evaluations.

Note that the computation of in (2.14) requires an involved computation of . In order to save computational time when implementing this algorithm, we suggest to compute this quantity in a recursive manner as follows. Let us denote . Clearly, in view of the fact that , , we have

 gt=gt−1+1m(ytit−yt−1it). (2.18)

Also, by the definition of and (2.13), we have

 1m∑mi=1~yti =1m∑mi=1yt−1i+αtm(yt−1it−1−yt−2it−1)=gt−1+αtm(yt−1it−1−yt−2it−1). (2.19)

Using these two ideas mentioned above, we can compute in two steps: i) initialize , and update as in (2.18) after the gradient evaluation step (2.16); ii) replace (2.13) by (2.19) to compute . Also note that the difference can be saved as it is used in both (2.18) and (2.19) for the next iteration. These enhancements will be incorporated into the distributed setting in Subsection 2.3 to possibly save communication costs.

It is also interesting to observe the differences between RGEM and RPDG . RGEM has only one extrapolation step (2.13) which combines two types of predictions. One is to predict future gradients using historic data, and the other is to obtain an estimator of the current exact gradient of from the randomly updated gradient information of . However, RPDG method needs two extrapolation steps in both the primal and dual spaces. Due to the existence of the primal extrapolation step, RPDG cannot guarantee the search points where it performs gradient evaluations to fall within the feasible set . Hence, it requires the assumption that ’s are differentiable with Lipschitz continuous gradients over . Such a strong assumption is not required by RGEM, since all the primal iterates generated by RGEM stay within the feasible region . As a result, RGEM can deal with a much wider class of problems than RPDG. Moreover, RGEM allows no exact gradient computation for initialization, which provides a fully-distributed algorithmic framework under the assumption that there exists such that

 1m∑mi=1∥∇fi(x0)∥2∗≤σ20, (2.20)

where is the given initial point.

We now provide a constant step-size policy for RGEM to solve strongly convex problems given in the form of (1.1) and show that the resulting algorithm exhibits an optimal linear rate of convergence in Theorem 2.1. The proof of Theorem 2.1 can be found in Subsection 4.1.

Let be an optimal solution of (1.1), and be defined in (2.14) and (2.17), respectively, and . Also let , and be set to

 τt≡τ=1m(1−α)−1,   ηt≡η=α1−αμ,   and   αt≡mα. (2.21)

If (2.20) holds and is set as

 α=1−1m+√m2+16m^L/μ, (2.22)

then

 E[P(xk,x∗)] ≤2Δ0,σ0αkμ, (2.23) E[ψ(x––k)−ψ(x∗)] ≤6max{m,^Lμ}Δ0,σ0αk/2, (2.24)

where

 Δ0,σ0:=μP(x0,x∗)+ψ(x0)−ψ∗+σ20mμ. (2.25)

In view of Theorem 2.1, we can provide bounds on the total number of gradient evaluations performed by RGEM to find a stochastic -solution of problem (1.1), i.e., a point s.t. . Theorem 2.1 implies the number of gradient evaluations of performed by RGEM to find a stochastic -solution of (1.1) can be bounded by

 K(ϵ,C,σ20)=2(m+√m2+16mC)log6max{m,C}Δ0,σ0ϵ=O{(m+√m^Lμ)log1ϵ}. (2.26)

Here . Therefore, whenever is dominating, and and are in the same order of magnitude, RGEM can save up to gradient evaluations of the component function than the optimal deterministic first-order methods. More specifically, RGEM does not require any exact gradient computation and its communication cost is similar to pure stochastic gradient descent. To the best of our knowledge, it is the first time that such an optimal RIG method is presented for solving (1.1) in the literature. It should be pointed out that while the rates of convergence of RGEM obtained in Theorem 2.1 is stated in terms of expectation, we can develop large-deviation results for these rates of convergence using similar techniques in  for solving strongly convex problems.

Furthermore, if a one-time exact gradient evaluation is available at the initial point, i.e., , we can drop the assumption in (2.20) and employ a more aggressive stepsize policy with

 α=1−2m+√m2+8m^L/μ,

Similarly, we can demonstrate that the number of gradient evaluations of performed by RGEM with this initialization method to find a stochastic -solution can be bounded by

 (m+√m2+8mC)log(6max{m,C}Δ0,0ϵ)+m=O{(m+√m^Lμ)log1ϵ}.

### 2.2 RGEM for stochastic finite-sum optimization

We discuss in this subsection the stochastic finite-sum optimization and online learning problems, where only noisy gradient information of can be accessed via a stochastic first-order () oracle. In particular, for any given point , the

oracle outputs a vector

s.t.

 Eξ[Gi(x––ti,ξti)]=∇fi(x––ti) , i=1,…,m, (2.27) Eξ[∥Gi(x––ti,ξti)−∇fi(x––ti)∥2∗]≤σ2 , i=1,…,m. (2.28)

We also assume that throughout this subsection that the is associated with the inner product .

As shown in Algorithm 2, the RGEM for stochastic finite-sum optimization is naturally obtained by replacing the gradient evaluation of in Algorithm 1 (see (2.16)) with a stochastic gradient estimator of given in (2.29). In particular, at each iteration, we collect number of stochastic gradients of only one randomly selected component and take their average as the stochastic estimator of . Moreover, it needs to be mentioned that the way RGEM initializes gradients, i.e, , is very important for stochastic optimization, since it is usually impossible to compute exact gradient for expectation functions even at the initial point.

Under the standard assumptions in (2.27) and (2.28) for stochastic optimization, and with proper choices of algorithmic parameters, Theorem 2.2 shows that RGEM can achieve the optimal rate of convergence (up to a certain logarithmic factor) for solving strongly convex problems given in the form of (1.5) in terms of the number of stochastic gradients of . The proof of the this result can be found in Subsection 4.2.

Let be an optimal solution of (1.5), and be generated by Algorithm 2, and . Suppose that and are defined in (2.20) and (2.28), respectively. Given the iteration limit , let , and be set to (2.21) with being set as (2.22), and we also set

 Bt=⌈k(1−α)2α−t⌉, t=1,…,k, (2.30)

then

 E[P(xk,x∗)] ≤2αkΔ0,σ0,σμ, (2.31) E[ψ(x––k)−ψ(x∗)] ≤6max{m,^Lμ}Δ0,σ0,σαk/2, (2.32)

where the expectation is taken w.r.t. and and

 Δ0,σ0,σ:=μP(x0,x∗)+ψ(x0)−ψ(x∗)+σ20/m+5σ2μ. (2.33)

In view of (2.32), the number of iterations performed by RGEM to find a stochastic -solution of (1.5), can be bounded by

 ^K(ϵ,C,σ20,σ2):=2(m+√m2+16mC)log6max{m,C}Δ0,σ0,σϵ. (2.34)

Furthermore, in view of (2.31) this iteration complexity bound can be improved to

 ¯K(ϵ,α,σ20,σ2):=log1/α2~Δ0,σ0,σμϵ, (2.35)

in terms of finding a point s.t. . Therefore, the corresponding number of stochastic gradient evaluations performed by RGEM for solving problem (1.5) can be bounded by

 ∑kt=1Bt≤k∑kt=1(1−α)2α−t+k=O{(Δ0,σ0,σμϵ+m+√mC)logΔ0,σ0,σμϵ}, (2.36)

which together with (2.33) imply that the total number of required stochastic gradients or samples of the random variables , can be bounded by

 ~O{σ20/m+σ2μ2ϵ+μP(x0,x∗)+ψ(x0)−ψ∗μϵ+m+√m^Lμ}.

Observe that this bound does not depend on the number of terms for small enough . To the best of our knowledge, it is the first time that such a convergence result is established for RIG algorithms to solve distributed stochastic finite-sum problems. This complexity bound in fact is in the same order of magnitude (up to a logarithmic factor) as the complexity bound achieved by the optimal accelerated stochastic approximation methods [11, 12, 19], which uniformly sample all the random variables . However, this latter approach will thus involve much higher communication costs in the distributed setting (see Subsection 2.3 for more discussions).

### 2.3 RGEM for distributed optimization and machine learning

This subsection is devoted to RGEMs (see Algorithm 1 and Algorithm 2) from two different perspectives, i.e., the server and the activated agent under a distributed setting. We also discuss the communication costs incurred by RGEM under this setting.

We now add some remarks about the potential benefits of RGEM for distributed optimization and machine learning. Firstly, since RGEM does not require any exact gradient evaluation of , it does not need to wait for the responses from all agents in order to compute an exact gradient. Each iteration of RGEM only involves communication between the server and the activated -th agent. In fact, RGEM will move to the next iteration in case no response is received from the

-th agent. This scheme works under the assumption that the probability for any agent being responsive or available at a certain point of time is equal. However, all other optimal RIG algorithms, except RPDG

, need the exact gradient information from all network agents once in a while, which incurs high communication costs and synchronous delays as long as one agent is not responsive. Even RPDG requires a full round of communications and synchronization at the initial point.

Secondly, since each iteration of RGEM involves only constant number of communication rounds between the server and one selected agent, the communication complexity for RGEM under distributed setting can be bounded by

 O{(m+√m^Lμ)log1ϵ}.

Therefore, it can save up to rounds of communication than the optimal deterministic first-order methods.

For solving distributed stochastic finite-sum optimization problems (1.5), RGEM from the -th agent’s perspective will be slightly modified as follows.

Similar to the case for the deterministic finite-sum optimization, the total number of communication rounds performed by the above RGEM can be bounded by

 O{(m+√m^Lμ)log1ϵ},

for solving (1.5). Each round of communication only involves the server and a randomly selected agent. This communication complexity seems to be optimal, since it matches the lower complexity bound (1.7) established in . Moreover, the sampling complexity, i.e., the total number of samples to be collected by all the agents, is also nearly optimal and comparable to the case when all these samples are collected in a centralized location and processed by an optimal stochastic approximation method. On the other hand, if one applies an existing optimal stochastic approximation method to solve the distributed stochastic optimization problem, the communication complexity will be as high as , which is much worse than RGEM.

## 3 Gradient extrapolation method: dual of Nesterov’s acceleration

Our goal in this section is to introduce a new algorithmic framework, referred to as the gradient extrapolation method (GEM), for solving the convex optimization problem given by

 ψ∗:=minx∈X{ψ(x):=f(x)+μw(x)}. (3.1)

We show that GEM can be viewed as a dual of Nesterov’s accelerated gradient method although these two algorithms appear to be quite different. Moreover, GEM possess some nice properties which enable us to develop and analyze the random gradient extrapolation method for distributed and stochastic optimization.

### 3.1 Generalized Bregman distance

In this subsection, we provide a brief introduction to the generalized Bregman distance defined in (1.10) and some properties for its associated prox-mapping defined in(1.12).

Note that whenever is non-differentiable, we need to specify a particular selection of the subgradient before performing the prox-mapping. We assume throughout this paper that such a selection of is defined recursively as follows. Denote . By the optimality condition of (1.12), we have

 g+(μ+η)w′(x1)−ηw′(x0)∈NX(x1),

where denotes the normal cone of at . Once such a satisfying the above relation is identified, we will use it as a subgradient when defining in the next iteration. Note that such a subgradient can be identified as long as is obtained, since it satisfies the optimality condition of (1.12).

The following lemma, which generalizes Lemma 6 of  and Lemma 2 of , characterizes the solutions to (1.12). The proof of this result can be found in Lemma 5 of .

Let be a closed convex set and a point be given. Also let be a convex function and

 W(~u,u)=w(u)−w(~u)−⟨w′(~u),u−~u⟩

for some . Assume that the function satisfies

 q(u1)−q(u2)−⟨q′(u2),u1−u2⟩≥μ0W(u2,u1),  ∀u1,u2∈U

for some . Also assume that the scalars and are chosen such that . If

 u∗∈Argmin{q(u)+μ1w(u)+μ2W(~u,u):u∈U},

then for any , we have

 q(u∗)+μ1w(u∗)+μ2W(~u,u∗)+(μ0+μ1+μ2)W(u∗,u)≤q(u)+μ1w(u)+μ2W(~u,u).

### 3.2 The algorithm

As shown in Algorithm 3, GEM starts with a gradient extrapolation step (3.2) to compute from the two previous gradients and . Based on , it performs a proximal gradient descent step in (3.3) and updates the output solution . Finally, the gradient at is computed for gradient extrapolation in the next iteration. This algorithm is a special case of RGEM in Algorithm 1 (with ).

We now show that GEM can be viewed as the dual of the well-known Nesterov’s accelerated gradient (NAG) method as studied in . To see such a relationship, we will first rewrite GEM in a primal-dual form. Let us consider the dual space , where the gradients of reside, and equip it with the conjugate norm . Let be the conjugate function of such that We can reformulate the original problem in (3.1) as the following saddle point problem:

 ψ∗:=minx∈X{maxg∈G{⟨x,g⟩−Jf(g)}+μw(x)}. (3.6)

It is clear that is strongly convex with modulus w.r.t. (See Chapter E in  for details). Therefore, we can define its associated dual generalized Bregman distance and dual prox-mappings as

 Df(g0,g) :=Jf(g)−[Jf(g0)+⟨J′f(g0),g−g0⟩], (3.7) MG(−~x,g0,τ) :=argming∈G{⟨−~x,g⟩+Jf(g)+τDf(g0,g)}, (3.8)

for any