1 Introduction
Distributed Learning [20]
is one of the main concepts in Machine and Deep Learning
[16, 18]. And the popularity of this approach continues to grow. The reason is the increase in the size of neural network models and the development of Big Data. Modern devices cannot quickly process the volumes of information that industry requires from them. To speed up the learning process, we can use several devices, in this case each device solves its own subproblem, and their solutions merge into general and global one. The easiest way to do this is to divide the dataset into approximately equal parts and give each of the parts to a different device. This approach works great when devices are just powerful identical workers on the same cluster. But the modern world poses a challenge for us when workers are personal devices of users. Now the data can be fundamentally different on each of the devices (because it is defined by users), at the same time we want to solve a global learning problem without forgetting about each client. Here we can mention Federated Learning [12, 8] – a very young offshoot of Distributed Learning. As mentioned above, the main idea of Federated Learning is to train the global model with attention to each local client’s submodels.The bottleneck of Distributed Learning is communication. The communication process is much slower and more expensive than the local learning step. One can solve it by reducing the cost of communication [2, 4]. Another approach is to decrease the amount of communications. In this paper, we look at one of the most famous such concepts – Local SGD [24] (its classic centralized version). This idea of this method is rather simple: each device makes several local steps, training its own model, then they send the obtained parameters of the models to the server, which in turn averages these parameters and sends the obtained values back to the models, and then this procedure repeats. This approach can be considered in two cases [10]: homogeneous data – when each device has almost the same data, and heterogeneous – when the data on devices is significantly different. It is shown that for homogeneous data, the use of Local SGD is more preferable than in the heterogeneous case [23, 22]. But most of the analysis that is present in the literature is associated exclusively with the minimization problem: . The purpose of this work is to use Local SGD for saddlepoint problems:
(1) 
This problem is also popular in Machine Learning due to the great interest in the adversarial approach to network training. Now not one model is being trained, but two, and the main goal of the second model is to deceive the first. The main variety of such neural models is Generative Adversarial Network
[5]. In fact, GAN is nothing more than a classic saddlepoint problem. Therefore, it is important to be able to solve distributed problem (1), both in the context of neural networks and other applications in science, for example, in economics: matrix game, Nash equilibrium, etc. In particular, we note a fairly popular problem, which can also be written in the form of a saddlepoint problem – Wasserstein Barycenter [1]. It has a ton of practical applications, for example, it is now widely used in the analysis of brain images.1.1 Related works
In this section we discuss the works on which our contribution is based.
Saddlepoint problems. We highlight two main nondistributed algorithms. First algorithm – Mirror Descent [3], it is customary to use it in the nonsmooth case. For a smooth problem, use a modification of the Mirror Descent with an extra step [13] – Mirror Prox [19, 7]. There are modifications to this algorithm that do the extra step, but do not additionally call the oracle for it [6].
Local SGD. The idea of distributed optimization using methods similar to Local SGD is not an achievement of recent years and comes back to [17, 24]. The development of theoretical analysis for these methods can be traced in the following works [14, 21, 10, 23, 22, 11]
. One can also note the use of additional techniques aimed at variance reduction
[9].Local FixedPoint Method. Solving saddlepoint problems with Local SGD is not popular in the literature. The work [15] is of interest. The authors give a generalization of Local SGD with an arbitrary operator for which we are looking for a fixed point. In particular, one can consider an operator of the following form: , where . If instead of we substitute the gradient of a function, we just get the operator for gradient descent. But one can use and obtain an operator for the saddlepoint problem descentascent. The main drawback of this work is the rather strong assumption regarding the operator . The operator is supposed to be firmly nonexpansive. Such a condition is satisfied for the operator with a convex function gradient, but fails if we substitute for the simplest and most classical saddle problem: (for more details see Appendix C). In this paper, we want to provide an analysis without such a strong assumption.
1.2 Our contributions
In this paper, we present an extra step modification [13] of the Local SGD algorithm for stochastic smooth stronglyconvex – stronglyconcave saddlepoint problems. We show that its theoretical bound of communication rounds with local iterations is are . Additionally, using the regularization technique, we transfer the results from the stronglyconvex – stronglyconcave case to the convexconcave and have a bound .
2 Main part
2.1 Settings and assumptions
As mentioned above, we consider problem (1), where the sets and are convex compact sets. For simplicity, we introduce the set , and the operator :
(2) 
We do not have access to the oracles for , at each iteration our oracles gives only some stochastic realisation . Additionally, we introduce the following assumptions:
Assumption 1. is Lipschitz continuous with constants and , i.e. for all
(3) 
Assumption 2. is stronglyconvexstronglyconcave. One can rewrite it in the following form: for all
(4) 
Assumption 3. is unbiased and has bounded variance, i.e.
(5) 
Assumption 4. The values of the local operator are considered sufficiently close to the value of the mean operator, i.e. for all
(6) 
where .
Hereinafter, we use the standard Euclidean norm . We also introduce the following notation – the Euclidean projection onto the set .
2.2 New algorithm and its convergence
Our algorithm is a standard Local SGD with a slight modification, namely, we added an extra step (see Algorithm 1). The extra step method is a standard approach to solving smooth saddlepoint problems. First of all, this is due to the fact that it allows for optimal theoretical analysis. But in practice, this method gives some minor convergence improvements.
Also, our algorithm provides the ability to set different moments of communication in the variables
and . One can note this approach works well for Federated learning of GANs. We can vary the communication frequencies for the generator and discriminator, if one of the models is the same on all devices, and the second requires more frequent averaging.We now present a theoretical analysis of the proposed method. To begin with, we introduce auxiliary sequences that we need only in theoretical analysis (the Algorithm 1 does not calculate them):
(7) 
Such sequences are really virtual, but one can see that at the communication moment or . At the last iteration, it is assumed that the algorithm communicates in both and . It means the algorithm’s answer is equal to . Therefore, we provide a theoretical analysis using these sequences.
Theorem 1
For the proof of the theorem see in Appendix B. It is also possible to prove the convergence corollary:
Corollary 1
Let , and , then from (8) we get:
(9)  
The proof of this fact is quite obvious and is a simple substitution and into (8) and using the fact: . Estimate in (9) can be rewritten without polylogarithmic and constant numerical factors as follows
(10) 
It can be seen that if we take , we have a convergence rate of about . The estimate for the number of communication rounds is .
As noted earlier, the bottleneck of distributed optimization is the time and cost of communications, so their number (not the total number of iterations) is the main issue. The obtained estimate for the number of communication rounds shows that if we made local iterations, then we communicated only times.
Regularization and the convexconcave case. The above estimates can be extended to the case when Assumption 2 is satisfied with – a convexconcave saddlepoint problem. For this, the original function is regularized in the following form:
where – the accuracy of the solution for , and – optimization set diameter. In this case, problem is solved with an accuracy . Then the following estimate is valid:
Corollary 2
For a regularized problem, we can write an estimate similar to (10):
For the proof, we just need to take into account that and also the fact that .
It can be seen that if we take , we have a convergence rate of about . The estimate for the number of communication rounds is .
3 Future work
This work is in progress. In the near future we want to add experiments on training GANs, as well as on solving Wasserstein Barycenter problem by our new method. In particular, in the case of GANs, we want to test our method in a homogeneous and heterogeneous case, and we also want to test the hypothesis that the generator and discriminator can be taught with different communication frequencies.
From the point of view of theory, we are concerned with the question of whether our proof is optimal and the obtained estimates are unimprovable in the given Assumptions 14. Can we get some other estimates? In particular, is it possible to obtain direct estimates for a convexconcave saddle without using regularization?
Also for future research, the question of studying a homogeneous case seems interesting (at the moment, we have only a heterogeneous case). As a simple observation, one can notice that in the homogeneous case . But can other estimates be changed?
References
 [1] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
 [2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
 [3] A. BenTal and A. Nemirovski. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. 2019.
 [4] Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020.
 [5] Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
 [6] YuGuan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the convergence of singlecall stochastic extragradient methods, 2019.
 [7] Anatoli Juditsky, Arkadii S. Nemirovskii, and Claire Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm, 2008.
 [8] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
 [9] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. arXiv preprint arXiv:1910.06378, 2019.

[10]
Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik.
Tighter theory for local SGD on identical and heterogeneous data.
In
International Conference on Artificial Intelligence and Statistics
, pages 4519–4529, 2020.  [11] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U Stich. A unified theory of decentralized sgd with changing topology and local updates. arXiv preprint arXiv:2003.10422, 2020.
 [12] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 [13] G. M. Korpelevich. The extragradient method for finding saddle points and other problems. 1976.
 [14] Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large minibatches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
 [15] Grigory Malinovsky, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, and Peter Richtarik. From local sgd to local fixed point methods for federated learning. arXiv preprint arXiv:2004.01442, 2020.

[16]
Ryan McDonald, Keith Hall, and Gideon Mann.
Distributed training strategies for the structured perceptron.
In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 456–464, 2010.  [17] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S. Mann. Efficient largescale distributed training of conditional maximum entropy models. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1231–1239. Curran Associates, Inc., 2009.
 [18] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
 [19] Arkadi Nemirovski. Proxmethod with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15:229–251, 01 2004.
 [20] Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 [21] Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
 [22] Blake Woodworth, Kumar Kshitij Patel, and Nathan Srebro. Minibatch vs local sgd for heterogeneous distributed learning. arXiv preprint arXiv:2006.04735, 2020.
 [23] Blake Woodworth, Kumar Kshitij Patel, Sebastian U Stich, Zhen Dai, Brian Bullins, H Brendan McMahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? arXiv preprint arXiv:2002.07839, 2020.

[24]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola.
Parallelized stochastic gradient descent.
In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603. Curran Associates, Inc., 2010.
Appendix A General facts and technical lemmas
Lemma 1
For arbitrary integer and arbitrary set of positive numbers we have
(11) 
Appendix B Proof of Theorem 1
We start our proof with the following lemma:
Lemma 2
Let and is convex compact set. We set , then for all :
Proof: The fact gives . Then
Applying this Lemma with , , and , we get
and with , , , :
Next, we sum up the two previous equalities
A small rearrangement gives
Adding to both sides of equality and using , we have
Then we take the total expectation of both sides of the equation
(12)  
Further, we need to additionally estimate two terms and . For this we prove the following two lemmas, but before that we introduce the additional notation:
(13) 
Lemma 3
The following estimate is valid:
(14) 
Proof:
We take into account the independence of all random vectors
and select only the conditional expectation on vectorUsing property of , we have:
For it is true that , then
Definition (13) ends the proof.
Lemma 4
The following estimate is valid:
Comments
There are no comments yet.