# Local SGD for Saddle-Point Problems

GAN is one of the most popular and commonly used neural network models. When the model is large and there is a lot of data, the learning process can be delayed. The standard way out is to use multiple devices. Therefore, the methods of distributed and federated training for GANs are an important question. But from an optimization point of view, GANs are nothing more than a classical saddle-point problem: min_x max_y f(x,y). Therefore, this paper focuses on the distributed optimization of the smooth stochastic saddle-point problems using Local SGD. We present a new algorithm specifically for our problem – Extra Step Local SGD. The obtained theoretical bounds of communication rounds are Ω(K^2/3 M^1/3) in strongly-convex-strongly-concave case and Ω(K^8/9 M^4/9) in convex-concave (here M – number of functions (nodes) and K - number of iterations).

## Authors

• 6 publications
• 2 publications
• 25 publications
• ### Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency

Local SGD is a promising approach to overcome the communication overhead...
02/25/2021 ∙ by Yuyang Deng, et al. ∙ 13

• ### Decentralized Distributed Optimization for Saddle Point Problems

We consider distributed convex-concave saddle point problems over arbitr...
02/15/2021 ∙ by Alexander Rogozin, et al. ∙ 0

• ### Decentralized Personalized Federated Min-Max Problems

Personalized Federated Learning has recently seen tremendous progress, a...
06/14/2021 ∙ by Aleksandr Beznosikov, et al. ∙ 0

• ### Zeroth-Order Algorithms for Smooth Saddle-Point Problems

In recent years, the importance of saddle-point problems in machine lear...
09/21/2020 ∙ by Abdurakhmon Sadiev, et al. ∙ 0

• ### Is Local SGD Better than Minibatch SGD?

We study local SGD (also known as parallel SGD and federated averaging),...
02/18/2020 ∙ by Blake Woodworth, et al. ∙ 5

• ### Local SGD: Unified Theory and New Efficient Methods

We present a unified framework for analyzing local SGD methods in the co...
11/03/2020 ∙ by Eduard Gorbunov, et al. ∙ 62

• ### Revisiting EXTRA for Smooth Distributed Optimization

EXTRA is a popular method for the dencentralized distributed optimizatio...
02/24/2020 ∙ by Huan Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Distributed Learning [20]

is one of the main concepts in Machine and Deep Learning

[16, 18]. And the popularity of this approach continues to grow. The reason is the increase in the size of neural network models and the development of Big Data. Modern devices cannot quickly process the volumes of information that industry requires from them. To speed up the learning process, we can use several devices, in this case each device solves its own subproblem, and their solutions merge into general and global one. The easiest way to do this is to divide the dataset into approximately equal parts and give each of the parts to a different device. This approach works great when devices are just powerful identical workers on the same cluster. But the modern world poses a challenge for us when workers are personal devices of users. Now the data can be fundamentally different on each of the devices (because it is defined by users), at the same time we want to solve a global learning problem without forgetting about each client. Here we can mention Federated Learning [12, 8] – a very young offshoot of Distributed Learning. As mentioned above, the main idea of Federated Learning is to train the global model with attention to each local client’s submodels.

The bottleneck of Distributed Learning is communication. The communication process is much slower and more expensive than the local learning step. One can solve it by reducing the cost of communication [2, 4]. Another approach is to decrease the amount of communications. In this paper, we look at one of the most famous such concepts – Local SGD [24] (its classic centralized version). This idea of this method is rather simple: each device makes several local steps, training its own model, then they send the obtained parameters of the models to the server, which in turn averages these parameters and sends the obtained values back to the models, and then this procedure repeats. This approach can be considered in two cases [10]: homogeneous data – when each device has almost the same data, and heterogeneous – when the data on devices is significantly different. It is shown that for homogeneous data, the use of Local SGD is more preferable than in the heterogeneous case [23, 22]. But most of the analysis that is present in the literature is associated exclusively with the minimization problem: . The purpose of this work is to use Local SGD for saddle-point problems:

 minx∈Xmaxy∈Yf(x,y):=1MM∑m=1fm(x,y). (1)

This problem is also popular in Machine Learning due to the great interest in the adversarial approach to network training. Now not one model is being trained, but two, and the main goal of the second model is to deceive the first. The main variety of such neural models is Generative Adversarial Network

[5]. In fact, GAN is nothing more than a classic saddle-point problem. Therefore, it is important to be able to solve distributed problem (1), both in the context of neural networks and other applications in science, for example, in economics: matrix game, Nash equilibrium, etc. In particular, we note a fairly popular problem, which can also be written in the form of a saddle-point problem – Wasserstein Barycenter [1]. It has a ton of practical applications, for example, it is now widely used in the analysis of brain images.

### 1.1 Related works

In this section we discuss the works on which our contribution is based.

Saddle-point problems. We highlight two main non-distributed algorithms. First algorithm – Mirror Descent [3], it is customary to use it in the nonsmooth case. For a smooth problem, use a modification of the Mirror Descent with an extra step [13] – Mirror Prox [19, 7]. There are modifications to this algorithm that do the extra step, but do not additionally call the oracle for it [6].

Local SGD. The idea of distributed optimization using methods similar to Local SGD is not an achievement of recent years and comes back to [17, 24]. The development of theoretical analysis for these methods can be traced in the following works [14, 21, 10, 23, 22, 11]

. One can also note the use of additional techniques aimed at variance reduction

[9].

Local Fixed-Point Method. Solving saddle-point problems with Local SGD is not popular in the literature. The work [15] is of interest. The authors give a generalization of Local SGD with an arbitrary operator for which we are looking for a fixed point. In particular, one can consider an operator of the following form: , where . If instead of we substitute the gradient of a function, we just get the operator for gradient descent. But one can use and obtain an operator for the saddle-point problem descent-ascent. The main drawback of this work is the rather strong assumption regarding the operator . The operator is supposed to be firmly nonexpansive. Such a condition is satisfied for the operator with a convex function gradient, but fails if we substitute for the simplest and most classical saddle problem: (for more details see Appendix C). In this paper, we want to provide an analysis without such a strong assumption.

### 1.2 Our contributions

In this paper, we present an extra step modification [13] of the Local SGD algorithm for stochastic smooth strongly-convex – strongly-concave saddle-point problems. We show that its theoretical bound of communication rounds with local iterations is are . Additionally, using the regularization technique, we transfer the results from the strongly-convex – strongly-concave case to the convex-concave and have a bound .

## 2 Main part

### 2.1 Settings and assumptions

As mentioned above, we consider problem (1), where the sets and are convex compact sets. For simplicity, we introduce the set , and the operator :

 Fm(z)=Fm(x,y)=(∇xfm(x,y)−∇yfm(x,y)). (2)

We do not have access to the oracles for , at each iteration our oracles gives only some stochastic realisation . Additionally, we introduce the following assumptions:

Assumption 1. is Lipschitz continuous with constants and , i.e. for all

 ∥Fm(z1)−Fm(z2)∥≤L∥z1−z2∥. (3)

Assumption 2. is strongly-convex-strongly-concave. One can rewrite it in the following form: for all

 ∥F(z1)−F(z2)∥≥μ∥z1−z2∥. (4)

Assumption 3. is unbiased and has bounded variance, i.e.

 E[Fm(z,ξ)]=Fm(z), E[∥Fm(z,ξ)−Fm(z)∥2]≤σ2. (5)

Assumption 4. The values of the local operator are considered sufficiently close to the value of the mean operator, i.e. for all

 ∥Fm(z)−F(z)∥≤D, (6)

where .

Hereinafter, we use the standard Euclidean norm . We also introduce the following notation – the Euclidean projection onto the set .

### 2.2 New algorithm and its convergence

Our algorithm is a standard Local SGD with a slight modification, namely, we added an extra step (see Algorithm 1). The extra step method is a standard approach to solving smooth saddle-point problems. First of all, this is due to the fact that it allows for optimal theoretical analysis. But in practice, this method gives some minor convergence improvements.

Also, our algorithm provides the ability to set different moments of communication in the variables

and . One can note this approach works well for Federated learning of GANs. We can vary the communication frequencies for the generator and discriminator, if one of the models is the same on all devices, and the second requires more frequent averaging.

We now present a theoretical analysis of the proposed method. To begin with, we introduce auxiliary sequences that we need only in theoretical analysis (the Algorithm 1 does not calculate them):

 ¯zk=1MM∑m=1zkm, ¯gk=1MM∑m=1Fm(zkm,ξkm) ¯zk+1/2=¯zk−γ¯gk, ¯zk+1=¯zk−γ¯gk+1/2 (7)

Such sequences are really virtual, but one can see that at the communication moment or . At the last iteration, it is assumed that the algorithm communicates in both and . It means the algorithm’s answer is equal to . Therefore, we provide a theoretical analysis using these sequences.

###### Theorem 1

Let denote the iterates of Algorithm 1 for solving problem (1) and from (2.2). Assumptions 1-4 are satisfied for each function and their . Also let – maximum distance between moments of communication (). Then, if

, we have the following estimate for the distance to the solution

:

 E[∥¯zK+1−z∗∥2] ≤ (1−μγ2)K∥¯z0−z∗∥2+20γσ2μM (8) +250γ2H3L2(2σ2+D2)μ2.

For the proof of the theorem see in Appendix B. It is also possible to prove the convergence corollary:

###### Corollary 1

Let , and , then from (8) we get:

 E[∥¯zK+1−z∗∥2] ≤ ∥¯z0−z∗∥2log2α2K2+20σ2logα2μ2MK (9) + 250H3L2log2α2(2σ2+D2)μ4K2.

The proof of this fact is quite obvious and is a simple substitution and into (8) and using the fact: . Estimate in (9) can be rewritten without polylogarithmic and constant numerical factors as follows

 ~O(∥¯z0−z∗∥2K2+σ2μ2MK+H3L2(σ2+D2)μ4K2). (10)

It can be seen that if we take , we have a convergence rate of about . The estimate for the number of communication rounds is .

As noted earlier, the bottleneck of distributed optimization is the time and cost of communications, so their number (not the total number of iterations) is the main issue. The obtained estimate for the number of communication rounds shows that if we made local iterations, then we communicated only times.

Regularization and the convex-concave case. The above estimates can be extended to the case when Assumption 2 is satisfied with – a convex-concave saddle-point problem. For this, the original function is regularized in the following form:

 g(x,y)=f(x,y)+ε2R2(∥x−x0∥2+∥y−y0∥2),

where – the accuracy of the solution for , and – optimization set diameter. In this case, problem is solved with an accuracy . Then the following estimate is valid:

###### Corollary 2

For a regularized problem, we can write an estimate similar to (10):

 ~O(HLR2K + ∥¯z0−z∗∥2K2 + σ2/3R4/3M1/3K1/3+H3/5L2/5(σ2/5+D2/5)K2/5).

For the proof, we just need to take into account that and also the fact that .

It can be seen that if we take , we have a convergence rate of about . The estimate for the number of communication rounds is .

## 3 Future work

This work is in progress. In the near future we want to add experiments on training GANs, as well as on solving Wasserstein Barycenter problem by our new method. In particular, in the case of GANs, we want to test our method in a homogeneous and heterogeneous case, and we also want to test the hypothesis that the generator and discriminator can be taught with different communication frequencies.

From the point of view of theory, we are concerned with the question of whether our proof is optimal and the obtained estimates are unimprovable in the given Assumptions 1-4. Can we get some other estimates? In particular, is it possible to obtain direct estimates for a convex-concave saddle without using regularization?

Also for future research, the question of studying a homogeneous case seems interesting (at the moment, we have only a heterogeneous case). As a simple observation, one can notice that in the homogeneous case . But can other estimates be changed?

## Appendix A General facts and technical lemmas

###### Lemma 1

For arbitrary integer and arbitrary set of positive numbers we have

 (m∑i=1ai)2≤mm∑i=1a2i. (11)

## Appendix B Proof of Theorem 1

We start our proof with the following lemma:

###### Lemma 2

Let and is convex compact set. We set , then for all :

 ∥z+−u∥2≤∥z−u∥2−2⟨y,z+−u⟩−∥z+−z∥2

Proof: The fact gives . Then

 ∥z+−u∥2 = ∥z+−z+z−u∥2 = ∥z−u∥2+2⟨z+−z,z−u⟩+∥z+−z∥2 = ∥z−u∥2+2⟨z+−z,z+−u⟩−∥z+−z∥2 = ∥z−u∥2+2⟨z+−(z−y),z+−u⟩−2⟨y,z+−u⟩−∥z+−z∥2 ≤ ∥z−u∥2−2⟨y,z+−u⟩−∥z+−z∥2

Applying this Lemma with , , and , we get

 ∥¯zk+1−z∗∥2=∥¯zk−z∗∥2−2γ⟨¯gk+1/2,¯zk+1−z∗⟩−∥¯zk+1−¯zk∥2,

and with , , , :

 ∥¯zk+1/2−¯zk+1∥2=∥¯zk−¯zk+1∥2−2γ⟨¯gk,¯zk+1/2−¯zk+1⟩−∥¯zk+1/2−¯zk∥2.

Next, we sum up the two previous equalities

 ∥¯zk+1−z∗∥2+∥¯zk+1/2−¯zk+1∥2 = ∥¯zk−z∗∥2−∥¯zk+1/2−¯zk∥2 −2γ⟨¯gk+1/2,¯zk+1−z∗⟩−2γ⟨¯gk,¯zk+1/2−¯zk+1⟩.

A small rearrangement gives

 ∥¯zk+1−z∗∥2+∥¯zk+1/2−¯zk+1∥2 = ∥¯zk−z∗∥2−∥¯zk+1/2−¯zk∥2 −2γ⟨¯gk+1/2,¯zk+1/2−z∗⟩+2γ⟨¯gk+1/2−¯gk,¯zk+1/2−¯zk+1⟩,

Adding to both sides of equality and using , we have

 ∥¯zk+1−z∗∥2=∥¯zk−z∗∥2−∥¯zk+1/2−¯zk∥2−2γ⟨¯gk+1/2,¯zk+1/2−z∗⟩+γ2∥¯gk+1/2−¯gk∥2.

Then we take the total expectation of both sides of the equation

 E[∥¯zk+1−z∗∥2] = E[∥¯zk−z∗∥2]−E[∥¯zk+1/2−¯zk∥2] (12) −2γE[⟨¯gk+1/2,¯zk+1/2−z∗⟩]+γ2E[∥¯gk+1/2−¯gk∥2].

Further, we need to additionally estimate two terms and . For this we prove the following two lemmas, but before that we introduce the additional notation:

 Err(k)=1MM∑m=1∥¯zk−zkm∥2. (13)
###### Lemma 3

The following estimate is valid:

 −2γE[⟨¯gk+1/2,¯zk+1/2−z∗⟩]≤−γμE[∥¯zk+1/2−z∗∥2]+γL2μE[Err(k+1/2)]. (14)

Proof:

We take into account the independence of all random vectors

and select only the conditional expectation on vector

 −2γE[⟨¯gk+1/2,¯zk+1/2−z∗⟩] = −2γE[⟨1MM∑m=1Eξk+1/2[Fm(zk+1/2m,ξk+1/2m)],¯zk+1/2−z∗⟩] −2γE[⟨1MM∑m=1Fm(zk+1/2m),¯zk+1/2−z∗⟩] = −2γE[⟨1MM∑m=1Fm(¯zk+1/2),¯zk+1/2−z∗⟩] = −2γE[⟨F(¯zk+1/2),¯zk+1/2−z∗⟩]

Using property of , we have:

 −2γE[⟨¯gk+1/2,¯zk+1/2−z∗⟩] = −2γE[⟨F(¯zk+1/2)−F(z∗),¯zk+1/2−z∗⟩] −2γμE[∥¯zk+1/2−z∗∥2]

For it is true that , then

 −2γE[⟨¯gk+1/2,¯zk+1/2−z∗⟩] ≤ −2γμE[∥¯zk+1/2−z∗∥2] +γμE[∥∥¯zk+1/2−z∗∥∥2]+γμE⎡⎣∥∥ ∥∥1MM∑m=1[Fm(¯zk+1/2)−Fm(zk+1/2m)]∥∥ ∥∥2⎤⎦ = −γμE[∥¯zk+1/2−z∗∥2]+γμM2E⎡⎣∥∥ ∥∥M∑m=1[Fm(¯zk+1/2)−Fm(zk+1/2m)]∥∥ ∥∥2⎤⎦ −γμE[∥¯zk+1/2−z∗∥2]+γμME[M∑m=1∥∥Fm(¯zk+1/2)−Fm(zk+1/2m)∥∥2] −γμE[∥¯zk+1/2−z∗∥2] +γL2μME[M∑m=1∥∥¯zk+1/2−zk+1/2m∥∥2].

Definition (13) ends the proof.

###### Lemma 4

The following estimate is valid:

 E[∥¯gk+1/2−¯gk∥2] ≤ 5L2E[∥¯zk+1/2−¯zk∥2]+10σ2M+5L2E[Err(k+1/2)]+5L2E[Err(k)]. (15)

Proof:

 E[∥¯gk+1/2−¯gk∥2] = E⎡⎣∥∥ ∥∥1MM∑m=1Fm(zk+1/2m,ξk+1/2m)−1MM∑m=1Fm(zkm,ξkm)∥∥ ∥∥2⎤⎦ 5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zk+1/2m,ξk+1/2m)−Fm(zk+1/2m)]∥∥ ∥∥2⎤⎦ +5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zkm,ξkm)−Fm(zkm)]∥∥ ∥∥2⎤⎦ +5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zk+1/2m)−Fm(¯zk+1/2)]∥∥ ∥∥2⎤⎦+5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zkm)−Fm(¯zk)]∥∥ ∥∥2⎤⎦ +5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(¯zk+1/2)−Fm(¯zk)]∥∥ ∥∥2⎤⎦ 5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zk+1/2m,ξk+1/2m)−Fm(zk+1/2m)]∥∥ ∥∥2⎤⎦ +5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zkm,ξkm)−Fm(zkm)]∥∥ ∥∥2⎤⎦ +5MM∑m=1E[∥∥Fm(zk+1/2m)−Fm(¯zk+1/2)∥∥2]+5MM∑m=1E[∥∥Fm(zkm)−Fm(¯zk)∥∥2] 5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zk+1/2m,ξk+1/2m)−Fm(zk+1/2m)]∥∥ ∥∥2⎤⎦ +5E⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zkm,ξkm)−Fm(zkm)]∥∥ ∥∥2⎤⎦ +5L2E[Err(k+1/2)]+5L2E[Err(k)]+5L2E[∥¯zk+1/2−¯zk∥2] = 5E⎡⎣Eξk+1/2⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zk+1/2m,ξk+1/2m)−Fm(zk+1/2m)]∥∥ ∥∥2⎤⎦⎤⎦ +5E⎡⎣Eξk⎡⎣∥∥ ∥∥1MM∑m=1[Fm(zkm,ξkm)−Fm(zkm)]∥∥ ∥∥2⎤⎦⎤⎦ +5L2E[Err(k+1/2)]+5L2E[Err(k)]+5L2E[∥¯zk+1/2−¯zk∥2].

Using independence of each machine and (2.1), we get:

 E[∥¯gk+1/2−¯gk∥2] ≤ 10σ2M+5L2E[Err(k+1/2)]+5L2E[Err(k)]+5L2E[∥¯zk+1/2−¯zk∥2].

We are now ready to combine (12), (14), (15) and get

 E[∥¯zk+1−z∗∥2] ≤ E[∥¯zk−z∗∥2]−E[∥¯zk+1/2−¯zk∥2] −γμE[∥¯zk+1/2−z∗∥2]+γL2μE[Err(k+1/2)] +5γ2L2E[∥¯zk+1/2−¯zk∥2]+10γ2σ2M +5γ2L2E[Err(k+1/2)]+5γ2L2E[Err(k)].

Together with it transforms to

 E[∥¯zk+1−z∗∥2] ≤ (1−μγ2)E[∥¯zk−z∗∥2]+10γ2σ2