1 Introduction
Federated learning has been well recognized as a framework able to protect data privacy (Konečnỳ et al., 2016; Smith et al., 2017b; Yang et al., 2019). Stateoftheart federated learning adopts the centralized network architecture where a centralized node collects the gradients sent from child agents to update the global model. Despite its simplicity, the centralized method suffers from communication and computational bottlenecks in the central node, especially for federated learning, where a large number of clients are usually involved. Moreover, to prevent reverse engineering of the user’s identity, a certain amount of noise must be added to the gradient to protect user privacy, which partially sacrifices the efficiency and the accuracy (Shokri and Shmatikov, 2015).
To further protect the data privacy and avoid the communication bottleneck, the decentralized architecture has been recently proposed (Vanhaesebrouck et al., 2017; Bellet et al., 2018), where the centralized node has been removed, and each node only communicates with its neighbors (with mutual trust) by exchanging their local models. Exchanging local models is usually favored with respect to the data privacy protection over sending private gradients because the local model is the aggregation or mixture of quite a large amount of data while the local gradient directly reflects only one or a batch of private data samples. Although advantages of decentralized architecture have been well recognized over the stateoftheart method (its centralized counterpart), it usually can only be run on the network with mutual trusts. That is, two nodes (or users) can exchange their local models only if they trust each other reciprocally (e.g. node A may trust node B, but if node B does not trust node A, they cannot communicate). Given a social network, one can only use the edges with mutual trust to run decentralized federated learning algorithms. Two immediate drawbacks will be

If all mutual trust edges do not form a connected network, the federated learning does not apply;

Removing all singlesided edges from the communication network could significantly reduce the efficiency of communication.
This leads to the question: How do we effectively utilize the singlesided trust edges under decentralized federated learning framework?
In this paper, we consider the social network scenario, where the centralized network is unavailable (e.g., there does not exist a central node that can build up the connection with all users, or the centralized communication cost is not affordable). We make a minimal assumption on the social network:

The data may come in a streaming fashion on each user node as the federated learning algorithm runs;

The trust between users may be singlesided, where user A trusts user B, but user B may not trust user A (“trust” means “would like to send information to”).
For the aforementioned setting, we develop a decentralized learning algorithm called online pushsum (OPS), which possess the following features:

Only models rather than local gradient are exchanged among clients in our algorithm. This can reduce the risk of exposing clients’ data privacy (Aono et al., 2017).

Our algorithm removes some constraints imposed by typical decentralized methods, which makes it more flexible in allowing arbitrary network topology. Each node only needs to know its out neighbors instead of the global topology.

We provide the rigorous regret analysis for the proposed algorithm and specifically distinguish two components in the online loss function: the adversary component and the stochastic component, which can model clients’ private data and internal connections between clients, respectively.
Notation
We adopt the following notation in this paper:

For random variable
subject to distribution , we use and to denote the set of random variables and distributions, respectively:Notation implies for any and .

For a decentralized network with nodes, we use
to present the confusion matrix, where
is the weight that node sends to node (). and are also used for denoting the sets of in neighbors of and out neighbors of node respectively. 
Norm denotes the norm by default.
2 Related Work
The concept of federated learning was first proposed in McMahan et al. (2016), which advocates a novel learning setting that learns a shared model by aggregating locallycomputed gradient updates without centralizing distributed data on devices. Early examples of research into federated learning also include Konečný et al. (2015, 2016), and a widespread blog article posted by Google AI (McMahan and Ramage, 2017). To address both statistical and system challenges, Smith et al. (2017a) and Caldas et al. (2018) propose a multitask learning framework for federated learning and its related optimization algorithm, which extends early works SDCA (ShalevShwartz and Zhang, 2013; Yang, 2013; Yang et al., 2013) and COCOA (Jaggi et al., 2014; Ma et al., 2015; Smith et al., 2016) to the federated learning setting. Among these optimization methods, Federated Averaging (FedAvg), proposed by McMahan et al. (2016), beats conventional synchronized minibatch SGD regarding communication rounds as well as converges on nonIID and unbalanced data. Recent rigorous theoretical analysis (Stich, 2018; Wang and Joshi, 2018; Yu et al., 2018; Lin et al., 2018) shows that FedAvg is a special case of averaging periodic SGD (also called “local SGD”) which allows nodes to perform local updates and infrequent synchronization between them to communicate less while converging quickly. All these previous works in federated learning consider the system constraints on privacy or communication and computation cost. However, they cannot be applied to the singlesided trust network (asymmetric topology matrix).
Decentralized learning is a typical parallel strategy where each worker is only required to communicate with its neighbors, which means the communication bottleneck (in the parameter server) is removed. It has already been proved that decentralized learning can outperform the traditional centralized learning when the worker number is comparably large under a poor network condition (Lian et al., 2017).
There are two main types of decentralized learning algorithms: fixed network topology (He et al., 2018), and timevarying (Nedić and Olshevsky, 2015; Lian et al., 2018) during training. Wu et al. (2017); Shen et al. (2018) shows that the decentralized SGD would converge with a comparable convergence rate to the centralized algorithm with less communication to make largescale model training feasible. Li et al. (2018) provide a systematic analysis of the decentralized learning pipeline.
Online learning has been studied for decades. It is well known that the lower bounds of online optimization methods are and for convex and strongly convex loss functions respectively (Hazan and others, 2016; ShalevShwartz and others, 2012). In recent years, due to the increasing volume of data, distributed online learning, especially decentralized methods, has attracted much attention. Examples of these works include Kamp et al. (2014); Shahrampour and Jadbabaie (2017); Lee et al. (2016). Notably, Zhao et al. (2019) shares similar problem definition and theoretical result as our paper. However, single sided communication is not allowed in their setting, making their results more restrictive.
3 Problem Setting
In this paper, we consider federated learning with clients (a.k.a., nodes). Each client can be either an edge server or some other kind of computing device such as smart phone, which has local private data and the local machine learning model stored on it. We assume the topological structure of the network of these nodes can be represented by a directed graph with vertex set and edge set . If there exist an edge , it means node and node have network connection and can directly send messages to .
Let denote the local model on the th node at iteration . In each iteration, node receives a new sample and computes a prediction for this new sample according to the current model (e.g., it may recommend some items to the user in the online recommendation system). After that, a loss function, associated with that new sample is received by node . The typical goal of online learning is to minimize the regret, which is defined as the difference between the summation of the losses incurred by the nodes’ prediction and the corresponding loss of the global optimal model :
where .
However, here we consider a more general online setting: the loss function of the th node at iteration is , which is additionally parametrized by a random variable . This is drawn from the distribution , and is mutually independent in terms of and , and we call this part as the stochastic component of loss function . The stochastic component can be utilized to characterize the internal randomness of nodes’ data, and the potential connection among different nodes. For example, music preference may be impacted by popular trends on the Internet, which can be formulated by our model by letting for all with some timevarying distribution . On the other hand, function is the adversarial component of the loss, which may include, for example, user’s profile, location, etc. Therefore, the objective regret naturally becomes the expectation of all the past losses:
with .
One benefit of the above formulation is that it partially solves the nonI.I.D. issue in federated learning. A fundamental assumption in many traditional distributed machine learning methods is that the data samples stored on all nodes are I.I.D., which fails to hold for federated learning since the data on each user’s device is highly correlated to the user’s preferences and habits. However, our formulation does not require the I.I.D. assumption to hold for the adversarial component at all. Even though the random samples for the stochastic component still need to be independent, they are allowed to be drawn from different distributions.
Finally, one should note that online optimization also includes stochastic optimization (i.e., data samples are drawn from a fixed distribution) and offline optimization (i.e., data are already collected before optimization begins) as its special cases (ShalevShwartz and others, 2012). Hence, our problem setting actually covers a wide range of applications.
4 Online PushSum Algorithm
In this section, we define the construction of confusion matrix and introduce the proposed algorithm.
4.1 Construction of Confusion Matrix
One important parameter of the algorithm is the confusion matrix . is a matrix depending on the network topology , which means if there is no directed edge in . If the value of is large, node will have stronger impact on node . However, still allows flexibility where users can specify its weights associated with existing edges, meaning that even if there is a physical connection between two nodes, the nodes can decide against using the channel. For example, even if , user still can set if user thinks node is not trustworthy and therefore chooses to exclude the channel from to .
Of course, there are still some constraints over .
must be a row stochastic matrix (i.e., each entry in
is nonnegative and the summation of each row is 1). This assumption is different from the one in classical decentralized distributed optimization, which typically assumes is symmetric and doubly stochastic (e.g., Duchi et al. 2011) (i.e., the summations of both rows and columns are all 1). Such a requirement is quite restrictive, because not all networks admit a doubly stochastic matrix (Gharesifard and Cortés 2010), and relinquishing double stochasticity can introduce bias in optimization (Ram et al., 2010; Tsianos and Rabbat, 2012). As a comparison, our assumption that is row stochastic will avoid such concerns since any nonnegative matrix with at least one positive entry on each row (which is already implied by connectivity of the graph) can be easily normalized into row stochastic. The relaxation of this assumption is crucial for federated learning, considering that the federated learning system usually involves complex network topology due to its large number of clients. Moreover, since each node only needs to make sure the summation of its outweights is 1, there is no need for it to be aware of the global network topology, which greatly benefits the implementation of federated system. On the other hand, requiring to be symmetric rules out the possibility of using asymmetric network topology and adopting singsided trust, while our method does not have such restriction.4.2 Algorithm Description
The proposed online pushsum algorithm is is presented in Algorithm 1. The algorithm design mainly follows the pattern of pushsum algorithm (Tsianos et al., 2012), but here we further generalize it into online setting.
The algorithm mainly consists of three steps:

Local update: each client applies the current local model to obtain the loss function, based on which an intermediate local model is computed;

Push: the weighted variable is sent to for all its out neighbors ;

Sum: all the received is summed and normalized to obtain the new local model .
It should be noted an auxiliary variables and are used in the algorithm. Actually, they are used in the algorithm to clarify the description but may be easily removed in the practical implementation. Besides, another variable is also introduced, which is the normalizing factor of . plays an important role in the pushsum algorithm, since is not doubly stochastic in our setting, and it is possible that the total weight receives does not equal to 1. The introduction of the normalizing factor helps the algorithm avoid issues brought by that is not doubly stochastic. Furthermore, when becomes doubly stochastic, it can be easily verified that and for any and , then Algorithm 1 reduces to the distributed online gradient method proposed by Zhao et al. (2019).
In the algorithm, the local data, which is encoded in the gradient (Shokri and Shmatikov, 2015), is only utilized in updating local model. What neighboring nodes exchanges is only limited to the local models. Exchanging models instead of local data reduces the risk of leaking users’ privacy.
4.3 Regret Analysis
In this subsection, we provide regret bound analysis of OPS algorithm. Due to the limitation of space, the detail proof is deferred to the supplementary material. For convenience, we first denote
To carry out the analysis, the following assumptions are required:
Assumption 1.
We make the following assumptions throughout this paper:

The topological graph is strongly connected.

is row stochastic.

For any and , the loss function is convex in .

The norm of the expected gradient is bounded, i.e., there exist constant such that
for any , and .
Here provides an upper bound for the adversarial component. On the other hand, measures the magnitude of stochasticity brought by stochastic component. When
, the problem setting simply reduces back to normal distributed online learning. As for the convexity and the domain boundedness assumptions, they are quite common in online learning literature such as
(Hazan and others, 2016).Equipped with these assumptions, now we are ready to present our main theorem:
Theorem 2.
For the online pushsum algorithm with step size , it holds that
(1) 
where
and , and are some constants defined in the appendix.
By choosing an optimal step size , we can obtain the following corollary:
Corollary 3.
If we set
the regret of OPS can be bounded by:
(2) 
Note that when and , where the problem setting just reduces to normal online optimization, the implied regret bound exactly matches the lower bound of online optimization (Hazan and others, 2016). Moreover, our result als matches the convergence rate of centralized online learning where for fully connected network. Hence, we can conclude the OPS algorithm has optimal dependence on .
This bound has a linear dependence on the number of nodes , but it is easy to understand. First, we have defined the regret to be the summation of the losses on all the nodes. Increasing naturally makes the regret larger. Second, our federated learning setting is different from the typical distributed learning in that I.I.D. assumption does not hold here. Each node contains distinct local data which may be drawn from totally different distributions. Therefore, adding more nodes is not helpful for decreasing the regret on existing clients.
Moreover, we also prove that the difference of the model on each worker could be bounded using the following Theorem and Corollary.
Theorem 4.
For the online pushsum algorithm with step size , it holds that
where is the average model, namely,
Corollary 5.
If we set
the difference of the model on each worker admits a faster convergence rate than regret:
5 Experiments
We compare the performance of our proposed Online PushSum (OPS) method with that of Decentralized Online Gradient method (DOL) and Centralized Online Gradient method (COL), and then evaluate the effectiveness of OPS in different network size and network topology density settings.
5.1 Implementation and Settings
We consider online logistic regression with squared
norm regularization: , where is set to . is the stochastic component of the function , which is caused by the randomness of the data in the experiment (introduced in Section § 3). We evaluate the learning performance by measuring the average loss , instead of using the dynamic regret directly, since the optimal reference point is the same for all the methods. The learning rate in Algorithm 1is tuned to be optimal for each dataset separately. The experiment implementation is based on Python 3.7.0, PyTorch
^{1}^{1}1https://pytorch.org/ 1.2.0, NetworkX^{2}^{2}2https://networkx.github.io/ 2.3, and scikitlearn^{3}^{3}3https://scikitlearn.org 0.20.3. The source code of Online PushSum (OPS) is available at https://tinyurl.com/OnlinePushSum.Dataset
Experiments were run on two realworld public datasets: SUSY^{4}^{4}4https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html##SUSY and RoomOccupancy^{5}^{5}5https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+. SUSY and RoomOccupancy are both largescale binary classification datasets, containing 5,000,000 and 20,566 samples, respectively. Each dataset is split into two subsets: the stochastic data and the adversarial data. The stochastic data is generated by allocating a fraction of samples (e.g., 50% of the whole dataset) to nodes randomly and uniformly. The adversarial data is generated by conducting on the remaining dataset to produce clusters, and then allocating every cluster to a node. As we analyzed previously, only the scattered stochastic data can boost the model performance by intranode communication. For each node, this preacquired data is transformed into streaming data to simulate online learning.
5.2 Compare OPS with DOL and COL
To compare OPS with DOL and COL, a network size with 128 nodes and 20 nodes are selected for SUSY and RoomOccupancy, respectively. For COL, its confusion matrix is fullyconnected (doubly stochastic matrix). For DOL and OPS, they are run with the same network topology and the same row stochastic matrix (asymmetric confusion matrix) to maintain a fair comparison. Such asymmetric confusion is constructed by setting each node’s number of neighbors as a random value which is smaller than a fixed upper bound and also ensures the strong connectivity of the whole network (this upperbound neighbor number is set to 32 for the SUSY dataset, while 10 is set for the RoomOccupancy dataset). Since DOL typically requires the network to be a symmetric and doubly stochastic confusion matrix, DOL is run in two settings for comparison. In the first setting, in order to meet the assumption of the symmetry and doubly stochasticity, all unidirectional connections are removed in the confusion matrix so that the row stochastic confusion matrix is degenerated into a doubly stochastic matrix. This setting is labeled as DOLSymm in Figure 2. In another setting, DOL is forced to run on the asymmetric network where each node naively aggregates its received models without considering whether its sending weights are equal to its receiving weights. DOLAsymm is used to label this setting in Figure 2.
As illustrated in Figure 2, in both two datasets, OPS outperforms DOLSymm in the row stochastic confusion matrix. This demonstrates that incorporating unidirectional communication can help to boost the model performance. In other words, OPS gains better performance in the singlesided trust network under the setting of federated learning. OPS also works better than DOLAsymm. Although DOLAsymm utilizes additional unidirectional connections, in some cases its performance is even worse than DOLSymm (e.g., Figure 1(a)). This phenomenon is most likely attributed to its simple aggregation pattern, which causes decreased performance in DOLAsymm when removing the doubly stochastic matrix assumption. These two observations confirm the effectiveness of OPS in a row stochastic confusion matrix, which is consistent with our theoretical analysis.
Comparing Figure 1(a) and Figure 1(b), we also observe that when increasing the ratio of the stochastic component, the average loss (regret) becomes smaller. It is reasonable that OPS achieves slightly worse performance than COL because OPS works in a sparsely connected network where information exchanging is much less than COL. We use the COL as the baseline in all experiments.
Only the number of iterations instead of the actual running time is considered in the experiment. It is redundant to present the actual running time. Because the centralized method requires more time for each iteration due to the network congestion in the central node, OPS usually outperforms COL in terms of running time.
5.3 Evaluation on Different Network Size
Figure 3 summarizes the evaluation of OPS in different network sizes (in the SUSY dataset, 128, 256, 512, 1024 are set, while in the smaller dataset, the network size is set to 10, 16, and 20). The upperbound neighbor number is aligned to the same value among different network sizes to isolate its impact. As we can see, in every dataset, the average loss (regret) curve in different network sizes is close on a small scale. These observations demonstrate OPS is robust to the network size. Furthermore, the average loss (regret) is smaller in a larger network size (as shown in Figure 2(a), the curve of the network size is lower than others), which also demonstrates that more stochastic samples provided by more nodes can naturally accelerate the convergence.
5.4 Evaluation on Network Density
We also evaluate the performance of OPS in different network densities. We fix the network size to 512 and 20 for SUSY and Room Occupancy dataset, respectively. Network density is defined as the ratio of the upperbound random neighbor number per node to the size of the network (e.g., if the ratio is 0.5 in SUSY, it means 256 is set as the upperbound neighbor number for each node). We can see from Figure 4 that as the network density increased, the average loss (regret) decreased. This observation also proves that our proposed OPS algorithm can work well in different network densities, and can gain more benefits from a denser row stochastic matrix. This benefit can also be understood intuitively: in a federated learning network, a user’s model performance will improve if it communicates with more users.
6 Conclusions
Decentralized federated learning with singlesided trust is a promising framework for solving a wide range of problems. In this paper, the online pushsum algorithm is developed for this setting, which is able to handle complex network topology and is proven to have optimal convergence rate. The regretbased online problem formulation also extends its applications. We tested the proposed OPS algorithm in various experiments, which have empirically justified its efficiency.
References

Privacypreserving deep learning via additively homomorphic encryption
. IEEE Transactions on Information Forensics and Security 13 (5), pp. 1333–1345. Cited by: 1st item.  Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792. Cited by: §7.
 Asynchronous subgradientpush. arXiv preprint arXiv:1803.08950. Cited by: §7.

Personalized and private peertopeer machine learning.
In
International Conference on Artificial Intelligence and Statistics
, pp. 473–481. Cited by: §1.  Federated Kernelized MultiTask Learning. The Conference on Systems and Machine Learning, pp. 3 (en). Cited by: §2.
 Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic control 57 (3), pp. 592–606. Cited by: §4.1.
 When does a digraph admit a doubly stochastic adjacency matrix?. In Proceedings of the 2010 American Control Conference, pp. 2440–2445. Cited by: §4.1.
 Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (34), pp. 157–325. Cited by: §2, §4.3, §4.3.
 COLA: decentralized linear learning. In Advances in Neural Information Processing Systems, pp. 4541–4551. Cited by: §2.
 Communicationefficient distributed dual coordinate ascent. In Advances in neural information processing systems, pp. 3068–3076. Cited by: §2.
 Communicationefficient distributed online prediction by dynamic model synchronization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 623–639. Cited by: §2.
 Federated Optimization:Distributed Optimization Beyond the Datacenter. arXiv:1511.03575 [cs, math]. Note: arXiv: 1511.03575 External Links: Link Cited by: §2.
 Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1.
 Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492 [cs] (en). Note: arXiv: 1610.05492 External Links: Link Cited by: §2.
 Coordinate dual averaging for decentralized online optimization with nonseparable global objectives. IEEE Transactions on Control of Network Systems 5 (1), pp. 34–44. Cited by: §2.
 Pipesgd: a decentralized pipelined sgd framework for distributed deep net training. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 8056–8067. Cited by: §2.

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent
. In Advances in Neural Information Processing Systems, pp. 5330–5340. Cited by: §2.  Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, Cited by: §2.
 Don’t Use Large MiniBatches, Use Local SGD. arXiv:1808.07217 [cs, stat] (en). Note: arXiv: 1808.07217 External Links: Link Cited by: §2.
 Adding vs. Averaging in Distributed PrimalDual Optimization. arXiv:1502.03508 [cs] (en). Note: arXiv: 1502.03508 External Links: Link Cited by: §2.
 Google AI Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data. External Links: Link Cited by: §2.
 CommunicationEfficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629 [cs] (en). Note: arXiv: 1602.05629 External Links: Link Cited by: §2.
 Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §7.
 Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §2.
 Stochastic gradientpush for strongly convex functions on timevarying directed graphs. IEEE Transactions on Automatic Control 61 (12), pp. 3936–3947. Cited by: §7.
 Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications 147 (3), pp. 516–545. Cited by: §4.1.
 Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control 63 (3), pp. 714–725. Cited by: §2.
 Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §2, §3.
 Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14 (Feb), pp. 567–599. Cited by: §2.
 Towards more efficient stochastic decentralized learning: faster convergence and sparse communication. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4624–4633. Cited by: §2.
 Privacypreserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §1, §4.2.
 Federated multitask learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §2.
 Federated multitask learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §1.
 CoCoA: A General Framework for CommunicationEfficient Distributed Optimization. arXiv:1611.02189 [cs] (en). Note: arXiv: 1611.02189 External Links: Link Cited by: §2.
 Local SGD Converges Fast and Communicates Little. External Links: Link Cited by: §2.
 Pushsum distributed dual averaging for convex optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 5453–5458. Cited by: §4.2.
 Distributed dual averaging for convex optimization under communication delays. In 2012 American Control Conference (ACC), pp. 1067–1072. Cited by: §4.1.
 Decentralized collaborative learning of personalized models over networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1.
 Cooperative SGD: A unified Framework for the Design and Analysis of CommunicationEfficient SGD Algorithms. (en). External Links: Link Cited by: §2.
 Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks PP, pp. 1–1. External Links: Document Cited by: §2.
 Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 12. Cited by: §1.
 Analysis of distributed stochastic dual coordinate ascent. arXiv preprint arXiv:1312.1031. Cited by: §2.
 Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 629–637. Cited by: §2.
 Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. arXiv:1807.06629 [cs, math] (en). Note: arXiv: 1807.06629 External Links: Link Cited by: §2.
 Decentralized online learning: take benefits from others’ data without sharing your own to track global trend. arXiv preprint arXiv:1901.10593. Cited by: §2, §4.2.
Supplementary Material
Notations:
References

Privacypreserving deep learning via additively homomorphic encryption
. IEEE Transactions on Information Forensics and Security 13 (5), pp. 1333–1345. Cited by: 1st item.  Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792. Cited by: §7.
 Asynchronous subgradientpush. arXiv preprint arXiv:1803.08950. Cited by: §7.

Personalized and private peertopeer machine learning.
In
International Conference on Artificial Intelligence and Statistics
, pp. 473–481. Cited by: §1.  Federated Kernelized MultiTask Learning. The Conference on Systems and Machine Learning, pp. 3 (en). Cited by: §2.
 Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic control 57 (3), pp. 592–606. Cited by: §4.1.
 When does a digraph admit a doubly stochastic adjacency matrix?. In Proceedings of the 2010 American Control Conference, pp. 2440–2445. Cited by: §4.1.
 Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (34), pp. 157–325. Cited by: §2, §4.3, §4.3.
 COLA: decentralized linear learning. In Advances in Neural Information Processing Systems, pp. 4541–4551. Cited by: §2.
 Communicationefficient distributed dual coordinate ascent. In Advances in neural information processing systems, pp. 3068–3076. Cited by: §2.
 Communicationefficient distributed online prediction by dynamic model synchronization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 623–639. Cited by: §2.
 Federated Optimization:Distributed Optimization Beyond the Datacenter. arXiv:1511.03575 [cs, math]. Note: arXiv: 1511.03575 External Links: Link Cited by: §2.
 Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1.
 Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492 [cs] (en). Note: arXiv: 1610.05492 External Links: Link Cited by: §2.
 Coordinate dual averaging for decentralized online optimization with nonseparable global objectives. IEEE Transactions on Control of Network Systems 5 (1), pp. 34–44. Cited by: §2.
 Pipesgd: a decentralized pipelined sgd framework for distributed deep net training. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 8056–8067. Cited by: §2.

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent
. In Advances in Neural Information Processing Systems, pp. 5330–5340. Cited by: §2.  Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, Cited by: §2.
 Don’t Use Large MiniBatches, Use Local SGD. arXiv:1808.07217 [cs, stat] (en). Note: arXiv: 1808.07217 External Links: Link Cited by: §2.
 Adding vs. Averaging in Distributed PrimalDual Optimization. arXiv:1502.03508 [cs] (en). Note: arXiv: 1502.03508 External Links: Link Cited by: §2.
 Google AI Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data. External Links: Link Cited by: §2.
 CommunicationEfficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629 [cs] (en). Note: arXiv: 1602.05629 External Links: Link Cited by: §2.
 Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §7.
 Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §2.
 Stochastic gradientpush for strongly convex functions on timevarying directed graphs. IEEE Transactions on Automatic Control 61 (12), pp. 3936–3947. Cited by: §7.
 Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications 147 (3), pp. 516–545. Cited by: §4.1.
 Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control 63 (3), pp. 714–725. Cited by: §2.
 Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §2, §3.
 Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14 (Feb), pp. 567–599. Cited by: §2.
 Towards more efficient stochastic decentralized learning: faster convergence and sparse communication. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4624–4633. Cited by: §2.
 Privacypreserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §1, §4.2.
 Federated multitask learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §2.
 Federated multitask learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §1.
 CoCoA: A General Framework for CommunicationEfficient Distributed Optimization. arXiv:1611.02189 [cs] (en). Note: arXiv: 1611.02189 External Links: Link Cited by: §2.
 Local SGD Converges Fast and Communicates Little. External Links: Link Cited by: §2.
 Pushsum distributed dual averaging for convex optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 5453–5458. Cited by: §4.2.
 Distributed dual averaging for convex optimization under communication delays. In 2012 American Control Conference (ACC), pp. 1067–1072. Cited by: §4.1.
 Decentralized collaborative learning of personalized models over networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1.
 Cooperative SGD: A unified Framework for the Design and Analysis of CommunicationEfficient SGD Algorithms. (en). External Links: Link Cited by: §2.
 Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks PP, pp. 1–1. External Links: Document Cited by: §2.
 Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 12. Cited by: §1.
 Analysis of distributed stochastic dual coordinate ascent. arXiv preprint arXiv:1312.1031. Cited by: §2.
 Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 629–637. Cited by: §2.
 Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. arXiv:1807.06629 [cs, math] (en). Note: arXiv: 1807.06629 External Links: Link Cited by: §2.
 Decentralized online learning: take benefits from others’ data without sharing your own to track global trend. arXiv preprint arXiv:1901.10593. Cited by: §2, §4.2.
7 Proof to Theorem 2 and Corollary 3
Proof.
Since the loss function is assumed to be convex, which leads to
For , we have
Notice that for COL, we have because . So for DOL, in order to bound , we need to bound the difference (using Lemma 9).
Summing up the inequality above from to , we get
Choosing , we have
So we have
Notice that Corollary 3 can be easily verified by setting . ∎
Next, we will present two lemmas for our proof of Lemma 9. The proofs of following two lemmas can be found in existing literature (Nedić and Olshevsky, 2014, 2016; Assran and Rabbat, 2018; Assran et al., 2018).
Lemma 6.
Under the Assumption 1, there exists a constant such that for any , the following holds
(3) 
where is a row stochastic matrix.
Lemma 7.
Under the Assumption 1, for any , there always exists a stochastic vector and two constants and such that for any satisfying , the following inequality holds
where is a row stochastic matrix, and is a vector with being its th entry.
Lemma 8.
Given two nonnegative sequences and that satisfying
(4) 
with , we have
Proof.
From the definition, we have
(5)  
∎
Based on the above three lemmas, we can obtain the following lemma.
Lemma 9.
Proof.
The updating rule of OPS can be formulated as
where is a row stochastic matrix. is a matrix whose each column is . is the matrix of gradient, whose each column is the stochastic gradient at on node . is the matrix whose each column is .
Assuming and , then we have
(6)  
(7)  
(8) 
where is the average of all variables on the nodes, and is the averaged gradient. We have since is a row stochastic matrix.
For , according to Lemma 7, we decompose it as follows
Comments
There are no comments yet.