Central Server Free Federated Learning over Single-sided Trust Social Networks

10/11/2019 ∙ by Chaoyang He, et al. ∙ University of Michigan University of Southern California University of Rochester 0

Federated learning has become increasingly important for modern machine learning, especially for data privacy-sensitive scenarios. Existing federated learning mostly adopts the central server-based architecture or centralized architecture. However, in many social network scenarios, centralized federated learning is not applicable (e.g., a central agent or server connecting all users may not exist, or the communication cost to the central server is not affordable). In this paper, we consider a generic setting: 1) the central server may not exist, and 2) the social network is unidirectional or of single-sided trust (i.e., user A trusts user B but user B may not trust user A). We propose a central server free federated learning algorithm, named Online Push-Sum (OPS) method, to handle this challenging but generic scenario. A rigorous regret analysis is also provided, which shows very interesting results on how users can benefit from communication with trusted users in the federated learning scenario. This work builds upon the fundamental algorithm framework and theoretical guarantees for federated learning in the generic social network scenario.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated learning has been well recognized as a framework able to protect data privacy (Konečnỳ et al., 2016; Smith et al., 2017b; Yang et al., 2019). State-of-the-art federated learning adopts the centralized network architecture where a centralized node collects the gradients sent from child agents to update the global model. Despite its simplicity, the centralized method suffers from communication and computational bottlenecks in the central node, especially for federated learning, where a large number of clients are usually involved. Moreover, to prevent reverse engineering of the user’s identity, a certain amount of noise must be added to the gradient to protect user privacy, which partially sacrifices the efficiency and the accuracy (Shokri and Shmatikov, 2015).

(a) Centralized
(b) Decentralized with mutual trust
(c) Decentralized with singe-sided trust
Figure 1: Different types of architectures.

To further protect the data privacy and avoid the communication bottleneck, the decentralized architecture has been recently proposed (Vanhaesebrouck et al., 2017; Bellet et al., 2018), where the centralized node has been removed, and each node only communicates with its neighbors (with mutual trust) by exchanging their local models. Exchanging local models is usually favored with respect to the data privacy protection over sending private gradients because the local model is the aggregation or mixture of quite a large amount of data while the local gradient directly reflects only one or a batch of private data samples. Although advantages of decentralized architecture have been well recognized over the state-of-the-art method (its centralized counterpart), it usually can only be run on the network with mutual trusts. That is, two nodes (or users) can exchange their local models only if they trust each other reciprocally (e.g. node A may trust node B, but if node B does not trust node A, they cannot communicate). Given a social network, one can only use the edges with mutual trust to run decentralized federated learning algorithms. Two immediate drawbacks will be

  • If all mutual trust edges do not form a connected network, the federated learning does not apply;

  • Removing all single-sided edges from the communication network could significantly reduce the efficiency of communication.

This leads to the question: How do we effectively utilize the single-sided trust edges under decentralized federated learning framework?

In this paper, we consider the social network scenario, where the centralized network is unavailable (e.g., there does not exist a central node that can build up the connection with all users, or the centralized communication cost is not affordable). We make a minimal assumption on the social network:

  • The data may come in a streaming fashion on each user node as the federated learning algorithm runs;

  • The trust between users may be single-sided, where user A trusts user B, but user B may not trust user A (“trust” means “would like to send information to”).

For the aforementioned setting, we develop a decentralized learning algorithm called online push-sum (OPS), which possess the following features:

  • Only models rather than local gradient are exchanged among clients in our algorithm. This can reduce the risk of exposing clients’ data privacy (Aono et al., 2017).

  • Our algorithm removes some constraints imposed by typical decentralized methods, which makes it more flexible in allowing arbitrary network topology. Each node only needs to know its out neighbors instead of the global topology.

  • We provide the rigorous regret analysis for the proposed algorithm and specifically distinguish two components in the online loss function: the adversary component and the stochastic component, which can model clients’ private data and internal connections between clients, respectively.

Notation

We adopt the following notation in this paper:

  • For random variable

    subject to distribution , we use and to denote the set of random variables and distributions, respectively:

    Notation implies for any and .

  • For a decentralized network with nodes, we use

    to present the confusion matrix, where

    is the weight that node sends to node (). and are also used for denoting the sets of in neighbors of and out neighbors of node respectively.

  • Norm denotes the norm by default.

2 Related Work

The concept of federated learning was first proposed in McMahan et al. (2016), which advocates a novel learning setting that learns a shared model by aggregating locally-computed gradient updates without centralizing distributed data on devices. Early examples of research into federated learning also include Konečný et al. (2015, 2016), and a widespread blog article posted by Google AI (McMahan and Ramage, 2017). To address both statistical and system challenges, Smith et al. (2017a) and Caldas et al. (2018) propose a multi-task learning framework for federated learning and its related optimization algorithm, which extends early works SDCA (Shalev-Shwartz and Zhang, 2013; Yang, 2013; Yang et al., 2013) and COCOA (Jaggi et al., 2014; Ma et al., 2015; Smith et al., 2016) to the federated learning setting. Among these optimization methods, Federated Averaging (FedAvg), proposed by McMahan et al. (2016), beats conventional synchronized mini-batch SGD regarding communication rounds as well as converges on non-IID and unbalanced data. Recent rigorous theoretical analysis (Stich, 2018; Wang and Joshi, 2018; Yu et al., 2018; Lin et al., 2018) shows that FedAvg is a special case of averaging periodic SGD (also called “local SGD”) which allows nodes to perform local updates and infrequent synchronization between them to communicate less while converging quickly. All these previous works in federated learning consider the system constraints on privacy or communication and computation cost. However, they cannot be applied to the single-sided trust network (asymmetric topology matrix).

Decentralized learning is a typical parallel strategy where each worker is only required to communicate with its neighbors, which means the communication bottleneck (in the parameter server) is removed. It has already been proved that decentralized learning can outperform the traditional centralized learning when the worker number is comparably large under a poor network condition (Lian et al., 2017).

There are two main types of decentralized learning algorithms: fixed network topology (He et al., 2018), and time-varying (Nedić and Olshevsky, 2015; Lian et al., 2018) during training. Wu et al. (2017); Shen et al. (2018) shows that the decentralized SGD would converge with a comparable convergence rate to the centralized algorithm with less communication to make large-scale model training feasible. Li et al. (2018) provide a systematic analysis of the decentralized learning pipeline.

Online learning has been studied for decades. It is well known that the lower bounds of online optimization methods are and for convex and strongly convex loss functions respectively (Hazan and others, 2016; Shalev-Shwartz and others, 2012). In recent years, due to the increasing volume of data, distributed online learning, especially decentralized methods, has attracted much attention. Examples of these works include Kamp et al. (2014); Shahrampour and Jadbabaie (2017); Lee et al. (2016). Notably, Zhao et al. (2019) shares similar problem definition and theoretical result as our paper. However, single sided communication is not allowed in their setting, making their results more restrictive.

3 Problem Setting

In this paper, we consider federated learning with clients (a.k.a., nodes). Each client can be either an edge server or some other kind of computing device such as smart phone, which has local private data and the local machine learning model stored on it. We assume the topological structure of the network of these nodes can be represented by a directed graph with vertex set and edge set . If there exist an edge , it means node and node have network connection and can directly send messages to .

Let denote the local model on the -th node at iteration . In each iteration, node receives a new sample and computes a prediction for this new sample according to the current model (e.g., it may recommend some items to the user in the online recommendation system). After that, a loss function, associated with that new sample is received by node . The typical goal of online learning is to minimize the regret, which is defined as the difference between the summation of the losses incurred by the nodes’ prediction and the corresponding loss of the global optimal model :

where .

However, here we consider a more general online setting: the loss function of the -th node at iteration is , which is additionally parametrized by a random variable . This is drawn from the distribution , and is mutually independent in terms of and , and we call this part as the stochastic component of loss function . The stochastic component can be utilized to characterize the internal randomness of nodes’ data, and the potential connection among different nodes. For example, music preference may be impacted by popular trends on the Internet, which can be formulated by our model by letting for all with some time-varying distribution . On the other hand, function is the adversarial component of the loss, which may include, for example, user’s profile, location, etc. Therefore, the objective regret naturally becomes the expectation of all the past losses:

with .

One benefit of the above formulation is that it partially solves the non-I.I.D. issue in federated learning. A fundamental assumption in many traditional distributed machine learning methods is that the data samples stored on all nodes are I.I.D., which fails to hold for federated learning since the data on each user’s device is highly correlated to the user’s preferences and habits. However, our formulation does not require the I.I.D. assumption to hold for the adversarial component at all. Even though the random samples for the stochastic component still need to be independent, they are allowed to be drawn from different distributions.

Finally, one should note that online optimization also includes stochastic optimization (i.e., data samples are drawn from a fixed distribution) and offline optimization (i.e., data are already collected before optimization begins) as its special cases (Shalev-Shwartz and others, 2012). Hence, our problem setting actually covers a wide range of applications.

4 Online Push-Sum Algorithm

In this section, we define the construction of confusion matrix and introduce the proposed algorithm.

4.1 Construction of Confusion Matrix

One important parameter of the algorithm is the confusion matrix . is a matrix depending on the network topology , which means if there is no directed edge in . If the value of is large, node will have stronger impact on node . However, still allows flexibility where users can specify its weights associated with existing edges, meaning that even if there is a physical connection between two nodes, the nodes can decide against using the channel. For example, even if , user still can set if user thinks node is not trustworthy and therefore chooses to exclude the channel from to .

Of course, there are still some constraints over .

must be a row stochastic matrix (i.e., each entry in

is non-negative and the summation of each row is 1). This assumption is different from the one in classical decentralized distributed optimization, which typically assumes is symmetric and doubly stochastic (e.g., Duchi et al. 2011) (i.e., the summations of both rows and columns are all 1). Such a requirement is quite restrictive, because not all networks admit a doubly stochastic matrix (Gharesifard and Cortés 2010), and relinquishing double stochasticity can introduce bias in optimization (Ram et al., 2010; Tsianos and Rabbat, 2012). As a comparison, our assumption that is row stochastic will avoid such concerns since any non-negative matrix with at least one positive entry on each row (which is already implied by connectivity of the graph) can be easily normalized into row stochastic. The relaxation of this assumption is crucial for federated learning, considering that the federated learning system usually involves complex network topology due to its large number of clients. Moreover, since each node only needs to make sure the summation of its out-weights is 1, there is no need for it to be aware of the global network topology, which greatly benefits the implementation of federated system. On the other hand, requiring to be symmetric rules out the possibility of using asymmetric network topology and adopting sing-sided trust, while our method does not have such restriction.

1:Learning rate , number of iterations , and the confusion matrix .
2:Initialize , for all
3:for  do
4:      For all users (say the -th node )
5:     Apply local model and suffer loss
6:     Locally computes the intermedia variable
7:     Send to all
8:     Update
9:end for
Algorithm 1 Online Push-Sum (OPS) Algorithm

4.2 Algorithm Description

The proposed online push-sum algorithm is is presented in Algorithm 1. The algorithm design mainly follows the pattern of push-sum algorithm (Tsianos et al., 2012), but here we further generalize it into online setting.

The algorithm mainly consists of three steps:

  1. Local update: each client applies the current local model to obtain the loss function, based on which an intermediate local model is computed;

  2. Push: the weighted variable is sent to for all its out neighbors ;

  3. Sum: all the received is summed and normalized to obtain the new local model .

It should be noted an auxiliary variables and are used in the algorithm. Actually, they are used in the algorithm to clarify the description but may be easily removed in the practical implementation. Besides, another variable is also introduced, which is the normalizing factor of . plays an important role in the push-sum algorithm, since is not doubly stochastic in our setting, and it is possible that the total weight receives does not equal to 1. The introduction of the normalizing factor helps the algorithm avoid issues brought by that is not doubly stochastic. Furthermore, when becomes doubly stochastic, it can be easily verified that and for any and , then Algorithm 1 reduces to the distributed online gradient method proposed by Zhao et al. (2019).

In the algorithm, the local data, which is encoded in the gradient (Shokri and Shmatikov, 2015), is only utilized in updating local model. What neighboring nodes exchanges is only limited to the local models. Exchanging models instead of local data reduces the risk of leaking users’ privacy.

4.3 Regret Analysis

In this subsection, we provide regret bound analysis of OPS algorithm. Due to the limitation of space, the detail proof is deferred to the supplementary material. For convenience, we first denote

To carry out the analysis, the following assumptions are required:

Assumption 1.

We make the following assumptions throughout this paper:

  • The topological graph is strongly connected.

  • is row stochastic.

  • For any and , the loss function is convex in .

  • The norm of the expected gradient is bounded, i.e., there exist constant such that

    for any , and .

  • The gradient variance is also bounded by

    , namely,

  • The problem domain is bounded such that for any two vectors

    and we always have .

Here provides an upper bound for the adversarial component. On the other hand, measures the magnitude of stochasticity brought by stochastic component. When

, the problem setting simply reduces back to normal distributed online learning. As for the convexity and the domain boundedness assumptions, they are quite common in online learning literature such as

(Hazan and others, 2016).

Equipped with these assumptions, now we are ready to present our main theorem:

Theorem 2.

For the online push-sum algorithm with step size , it holds that

(1)

where

and , and are some constants defined in the appendix.

By choosing an optimal step size , we can obtain the following corollary:

Corollary 3.

If we set

the regret of OPS can be bounded by:

(2)

Note that when and , where the problem setting just reduces to normal online optimization, the implied regret bound exactly matches the lower bound of online optimization (Hazan and others, 2016). Moreover, our result als matches the convergence rate of centralized online learning where for fully connected network. Hence, we can conclude the OPS algorithm has optimal dependence on .

This bound has a linear dependence on the number of nodes , but it is easy to understand. First, we have defined the regret to be the summation of the losses on all the nodes. Increasing naturally makes the regret larger. Second, our federated learning setting is different from the typical distributed learning in that I.I.D. assumption does not hold here. Each node contains distinct local data which may be drawn from totally different distributions. Therefore, adding more nodes is not helpful for decreasing the regret on existing clients.

Moreover, we also prove that the difference of the model on each worker could be bounded using the following Theorem and Corollary.

Theorem 4.

For the online push-sum algorithm with step size , it holds that

where is the average model, namely,

The proof of Theorem 4 can be found in Lemma 9.

Corollary 5.

If we set

the difference of the model on each worker admits a faster convergence rate than regret:

(a) stochastic=100%
(, #Neighbors=32)
(b) stochastic=50%
(, #Neighbors=32)
(c) stochastic=100%
(, #Neighbors=10)
(d) stochastic=50%
(, #Neighbors=10)
Figure 2: Comparison OPS with DOL (Decentralized Online Learning) and COL (Centralized Online Learning)

5 Experiments

We compare the performance of our proposed Online Push-Sum (OPS) method with that of Decentralized Online Gradient method (DOL) and Centralized Online Gradient method (COL), and then evaluate the effectiveness of OPS in different network size and network topology density settings.

5.1 Implementation and Settings

We consider online logistic regression with squared

norm regularization: , where is set to . is the stochastic component of the function , which is caused by the randomness of the data in the experiment (introduced in Section § 3). We evaluate the learning performance by measuring the average loss , instead of using the dynamic regret directly, since the optimal reference point is the same for all the methods. The learning rate in Algorithm 1

is tuned to be optimal for each dataset separately. The experiment implementation is based on Python 3.7.0, PyTorch

111https://pytorch.org/ 1.2.0, NetworkX222https://networkx.github.io/ 2.3, and scikit-learn333https://scikit-learn.org 0.20.3. The source code of Online Push-Sum (OPS) is available at https://tinyurl.com/Online-Push-Sum.

Dataset

Experiments were run on two real-world public datasets: SUSY444https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html##SUSY and Room-Occupancy555https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+. SUSY and Room-Occupancy are both large-scale binary classification datasets, containing 5,000,000 and 20,566 samples, respectively. Each dataset is split into two subsets: the stochastic data and the adversarial data. The stochastic data is generated by allocating a fraction of samples (e.g., 50% of the whole dataset) to nodes randomly and uniformly. The adversarial data is generated by conducting on the remaining dataset to produce clusters, and then allocating every cluster to a node. As we analyzed previously, only the scattered stochastic data can boost the model performance by intra-node communication. For each node, this pre-acquired data is transformed into streaming data to simulate online learning.

5.2 Compare OPS with DOL and COL

To compare OPS with DOL and COL, a network size with 128 nodes and 20 nodes are selected for SUSY and Room-Occupancy, respectively. For COL, its confusion matrix is fully-connected (doubly stochastic matrix). For DOL and OPS, they are run with the same network topology and the same row stochastic matrix (asymmetric confusion matrix) to maintain a fair comparison. Such asymmetric confusion is constructed by setting each node’s number of neighbors as a random value which is smaller than a fixed upper bound and also ensures the strong connectivity of the whole network (this upper-bound neighbor number is set to 32 for the SUSY dataset, while 10 is set for the Room-Occupancy dataset). Since DOL typically requires the network to be a symmetric and doubly stochastic confusion matrix, DOL is run in two settings for comparison. In the first setting, in order to meet the assumption of the symmetry and doubly stochasticity, all unidirectional connections are removed in the confusion matrix so that the row stochastic confusion matrix is degenerated into a doubly stochastic matrix. This setting is labeled as DOL-Symm in Figure 2. In another setting, DOL is forced to run on the asymmetric network where each node naively aggregates its received models without considering whether its sending weights are equal to its receiving weights. DOL-Asymm is used to label this setting in Figure 2.

As illustrated in Figure 2, in both two datasets, OPS outperforms DOL-Symm in the row stochastic confusion matrix. This demonstrates that incorporating unidirectional communication can help to boost the model performance. In other words, OPS gains better performance in the single-sided trust network under the setting of federated learning. OPS also works better than DOL-Asymm. Although DOL-Asymm utilizes additional unidirectional connections, in some cases its performance is even worse than DOL-Symm (e.g., Figure 1(a)). This phenomenon is most likely attributed to its simple aggregation pattern, which causes decreased performance in DOL-Asymm when removing the doubly stochastic matrix assumption. These two observations confirm the effectiveness of OPS in a row stochastic confusion matrix, which is consistent with our theoretical analysis.

Comparing Figure 1(a) and Figure 1(b), we also observe that when increasing the ratio of the stochastic component, the average loss (regret) becomes smaller. It is reasonable that OPS achieves slightly worse performance than COL because OPS works in a sparsely connected network where information exchanging is much less than COL. We use the COL as the baseline in all experiments.

Only the number of iterations instead of the actual running time is considered in the experiment. It is redundant to present the actual running time. Because the centralized method requires more time for each iteration due to the network congestion in the central node, OPS usually outperforms COL in terms of running time.

5.3 Evaluation on Different Network Size

(a) stochastic=100%
           #Neighbors=64
(b) stochastic=50%
           #Neighbors=64
(c) stochastic=100%
           #Neighbors=2
(d) stochastic=50%
           #Neighbors=2
Figure 3: Evaluation on different Network Sizes
(a) stochastic=100%
(b) stochastic=50%
(c) stochastic=100%
(d) stochastic=50%
Figure 4: Evaluation on the Network Density

Figure 3 summarizes the evaluation of OPS in different network sizes (in the SUSY dataset, 128, 256, 512, 1024 are set, while in the smaller dataset, the network size is set to 10, 16, and 20). The upper-bound neighbor number is aligned to the same value among different network sizes to isolate its impact. As we can see, in every dataset, the average loss (regret) curve in different network sizes is close on a small scale. These observations demonstrate OPS is robust to the network size. Furthermore, the average loss (regret) is smaller in a larger network size (as shown in Figure 2(a), the curve of the network size is lower than others), which also demonstrates that more stochastic samples provided by more nodes can naturally accelerate the convergence.

5.4 Evaluation on Network Density

We also evaluate the performance of OPS in different network densities. We fix the network size to 512 and 20 for SUSY and Room Occupancy dataset, respectively. Network density is defined as the ratio of the upper-bound random neighbor number per node to the size of the network (e.g., if the ratio is 0.5 in SUSY, it means 256 is set as the upper-bound neighbor number for each node). We can see from Figure 4 that as the network density increased, the average loss (regret) decreased. This observation also proves that our proposed OPS algorithm can work well in different network densities, and can gain more benefits from a denser row stochastic matrix. This benefit can also be understood intuitively: in a federated learning network, a user’s model performance will improve if it communicates with more users.

6 Conclusions

Decentralized federated learning with single-sided trust is a promising framework for solving a wide range of problems. In this paper, the online push-sum algorithm is developed for this setting, which is able to handle complex network topology and is proven to have optimal convergence rate. The regret-based online problem formulation also extends its applications. We tested the proposed OPS algorithm in various experiments, which have empirically justified its efficiency.

References

  • Y. Aono, T. Hayashi, L. Wang, S. Moriai, et al. (2017)

    Privacy-preserving deep learning via additively homomorphic encryption

    .
    IEEE Transactions on Information Forensics and Security 13 (5), pp. 1333–1345. Cited by: 1st item.
  • M. Assran, N. Loizou, N. Ballas, and M. Rabbat (2018) Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792. Cited by: §7.
  • M. Assran and M. Rabbat (2018) Asynchronous subgradient-push. arXiv preprint arXiv:1803.08950. Cited by: §7.
  • A. Bellet, R. Guerraoui, M. Taziki, and M. Tommasi (2018) Personalized and private peer-to-peer machine learning. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 473–481. Cited by: §1.
  • S. Caldas, V. Smith, and A. Talwalkar (2018) Federated Kernelized Multi-Task Learning. The Conference on Systems and Machine Learning, pp. 3 (en). Cited by: §2.
  • J. C. Duchi, A. Agarwal, and M. J. Wainwright (2011) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic control 57 (3), pp. 592–606. Cited by: §4.1.
  • B. Gharesifard and J. Cortés (2010) When does a digraph admit a doubly stochastic adjacency matrix?. In Proceedings of the 2010 American Control Conference, pp. 2440–2445. Cited by: §4.1.
  • E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §2, §4.3, §4.3.
  • L. He, A. Bian, and M. Jaggi (2018) COLA: decentralized linear learning. In Advances in Neural Information Processing Systems, pp. 4541–4551. Cited by: §2.
  • M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan (2014) Communication-efficient distributed dual coordinate ascent. In Advances in neural information processing systems, pp. 3068–3076. Cited by: §2.
  • M. Kamp, M. Boley, D. Keren, A. Schuster, and I. Sharfman (2014) Communication-efficient distributed online prediction by dynamic model synchronization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 623–639. Cited by: §2.
  • J. Konečný, B. McMahan, and D. Ramage (2015) Federated Optimization:Distributed Optimization Beyond the Datacenter. arXiv:1511.03575 [cs, math]. Note: arXiv: 1511.03575 External Links: Link Cited by: §2.
  • J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1.
  • J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492 [cs] (en). Note: arXiv: 1610.05492 External Links: Link Cited by: §2.
  • S. Lee, A. Nedić, and M. Raginsky (2016) Coordinate dual averaging for decentralized online optimization with nonseparable global objectives. IEEE Transactions on Control of Network Systems 5 (1), pp. 34–44. Cited by: §2.
  • Y. Li, M. Yu, S. Li, S. Avestimehr, N. S. Kim, and A. Schwing (2018) Pipe-sgd: a decentralized pipelined sgd framework for distributed deep net training. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8056–8067. Cited by: §2.
  • X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu (2017)

    Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent

    .
    In Advances in Neural Information Processing Systems, pp. 5330–5340. Cited by: §2.
  • X. Lian, W. Zhang, C. Zhang, and J. Liu (2018) Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, Cited by: §2.
  • T. Lin, S. U. Stich, and M. Jaggi (2018) Don’t Use Large Mini-Batches, Use Local SGD. arXiv:1808.07217 [cs, stat] (en). Note: arXiv: 1808.07217 External Links: Link Cited by: §2.
  • C. Ma, V. Smith, M. Jaggi, M. I. Jordan, P. Richtárik, and M. Takáč (2015) Adding vs. Averaging in Distributed Primal-Dual Optimization. arXiv:1502.03508 [cs] (en). Note: arXiv: 1502.03508 External Links: Link Cited by: §2.
  • B. McMahan and D. Ramage (2017) Google AI Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data. External Links: Link Cited by: §2.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas (2016) Communication-Efficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629 [cs] (en). Note: arXiv: 1602.05629 External Links: Link Cited by: §2.
  • A. Nedić and A. Olshevsky (2014) Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §7.
  • A. Nedić and A. Olshevsky (2015) Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §2.
  • A. Nedić and A. Olshevsky (2016) Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Transactions on Automatic Control 61 (12), pp. 3936–3947. Cited by: §7.
  • S. S. Ram, A. Nedić, and V. V. Veeravalli (2010) Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications 147 (3), pp. 516–545. Cited by: §4.1.
  • S. Shahrampour and A. Jadbabaie (2017) Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control 63 (3), pp. 714–725. Cited by: §2.
  • S. Shalev-Shwartz et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §2, §3.
  • S. Shalev-Shwartz and T. Zhang (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14 (Feb), pp. 567–599. Cited by: §2.
  • Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, and H. Qian (2018) Towards more efficient stochastic decentralized learning: faster convergence and sparse communication. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4624–4633. Cited by: §2.
  • R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §1, §4.2.
  • V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017a) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §2.
  • V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017b) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §1.
  • V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, and M. Jaggi (2016) CoCoA: A General Framework for Communication-Efficient Distributed Optimization. arXiv:1611.02189 [cs] (en). Note: arXiv: 1611.02189 External Links: Link Cited by: §2.
  • S. U. Stich (2018) Local SGD Converges Fast and Communicates Little. External Links: Link Cited by: §2.
  • K. I. Tsianos, S. Lawlor, and M. G. Rabbat (2012) Push-sum distributed dual averaging for convex optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 5453–5458. Cited by: §4.2.
  • K. I. Tsianos and M. G. Rabbat (2012) Distributed dual averaging for convex optimization under communication delays. In 2012 American Control Conference (ACC), pp. 1067–1072. Cited by: §4.1.
  • P. Vanhaesebrouck, A. Bellet, and M. Tommasi (2017) Decentralized collaborative learning of personalized models over networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1.
  • J. Wang and G. Joshi (2018) Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms. (en). External Links: Link Cited by: §2.
  • T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed (2017) Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks PP, pp. 1–1. External Links: Document Cited by: §2.
  • Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 12. Cited by: §1.
  • T. Yang, S. Zhu, R. Jin, and Y. Lin (2013) Analysis of distributed stochastic dual coordinate ascent. arXiv preprint arXiv:1312.1031. Cited by: §2.
  • T. Yang (2013) Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 629–637. Cited by: §2.
  • H. Yu, S. Yang, and S. Zhu (2018) Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. arXiv:1807.06629 [cs, math] (en). Note: arXiv: 1807.06629 External Links: Link Cited by: §2.
  • Y. Zhao, C. Yu, P. Zhao, and J. Liu (2019) Decentralized online learning: take benefits from others’ data without sharing your own to track global trend. arXiv preprint arXiv:1901.10593. Cited by: §2, §4.2.

Supplementary Material

Notations:

Below we use the following notation in our proof

Here we first present the proof to Theorem 2 and Corollary 3, then we will present some key lemmas.

References

  • Y. Aono, T. Hayashi, L. Wang, S. Moriai, et al. (2017)

    Privacy-preserving deep learning via additively homomorphic encryption

    .
    IEEE Transactions on Information Forensics and Security 13 (5), pp. 1333–1345. Cited by: 1st item.
  • M. Assran, N. Loizou, N. Ballas, and M. Rabbat (2018) Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792. Cited by: §7.
  • M. Assran and M. Rabbat (2018) Asynchronous subgradient-push. arXiv preprint arXiv:1803.08950. Cited by: §7.
  • A. Bellet, R. Guerraoui, M. Taziki, and M. Tommasi (2018) Personalized and private peer-to-peer machine learning. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 473–481. Cited by: §1.
  • S. Caldas, V. Smith, and A. Talwalkar (2018) Federated Kernelized Multi-Task Learning. The Conference on Systems and Machine Learning, pp. 3 (en). Cited by: §2.
  • J. C. Duchi, A. Agarwal, and M. J. Wainwright (2011) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Transactions on Automatic control 57 (3), pp. 592–606. Cited by: §4.1.
  • B. Gharesifard and J. Cortés (2010) When does a digraph admit a doubly stochastic adjacency matrix?. In Proceedings of the 2010 American Control Conference, pp. 2440–2445. Cited by: §4.1.
  • E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §2, §4.3, §4.3.
  • L. He, A. Bian, and M. Jaggi (2018) COLA: decentralized linear learning. In Advances in Neural Information Processing Systems, pp. 4541–4551. Cited by: §2.
  • M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan (2014) Communication-efficient distributed dual coordinate ascent. In Advances in neural information processing systems, pp. 3068–3076. Cited by: §2.
  • M. Kamp, M. Boley, D. Keren, A. Schuster, and I. Sharfman (2014) Communication-efficient distributed online prediction by dynamic model synchronization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 623–639. Cited by: §2.
  • J. Konečný, B. McMahan, and D. Ramage (2015) Federated Optimization:Distributed Optimization Beyond the Datacenter. arXiv:1511.03575 [cs, math]. Note: arXiv: 1511.03575 External Links: Link Cited by: §2.
  • J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1.
  • J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated Learning: Strategies for Improving Communication Efficiency. arXiv:1610.05492 [cs] (en). Note: arXiv: 1610.05492 External Links: Link Cited by: §2.
  • S. Lee, A. Nedić, and M. Raginsky (2016) Coordinate dual averaging for decentralized online optimization with nonseparable global objectives. IEEE Transactions on Control of Network Systems 5 (1), pp. 34–44. Cited by: §2.
  • Y. Li, M. Yu, S. Li, S. Avestimehr, N. S. Kim, and A. Schwing (2018) Pipe-sgd: a decentralized pipelined sgd framework for distributed deep net training. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8056–8067. Cited by: §2.
  • X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu (2017)

    Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent

    .
    In Advances in Neural Information Processing Systems, pp. 5330–5340. Cited by: §2.
  • X. Lian, W. Zhang, C. Zhang, and J. Liu (2018) Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, Cited by: §2.
  • T. Lin, S. U. Stich, and M. Jaggi (2018) Don’t Use Large Mini-Batches, Use Local SGD. arXiv:1808.07217 [cs, stat] (en). Note: arXiv: 1808.07217 External Links: Link Cited by: §2.
  • C. Ma, V. Smith, M. Jaggi, M. I. Jordan, P. Richtárik, and M. Takáč (2015) Adding vs. Averaging in Distributed Primal-Dual Optimization. arXiv:1502.03508 [cs] (en). Note: arXiv: 1502.03508 External Links: Link Cited by: §2.
  • B. McMahan and D. Ramage (2017) Google AI Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data. External Links: Link Cited by: §2.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas (2016) Communication-Efficient Learning of Deep Networks from Decentralized Data. arXiv:1602.05629 [cs] (en). Note: arXiv: 1602.05629 External Links: Link Cited by: §2.
  • A. Nedić and A. Olshevsky (2014) Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §7.
  • A. Nedić and A. Olshevsky (2015) Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control 60 (3), pp. 601–615. Cited by: §2.
  • A. Nedić and A. Olshevsky (2016) Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Transactions on Automatic Control 61 (12), pp. 3936–3947. Cited by: §7.
  • S. S. Ram, A. Nedić, and V. V. Veeravalli (2010) Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications 147 (3), pp. 516–545. Cited by: §4.1.
  • S. Shahrampour and A. Jadbabaie (2017) Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control 63 (3), pp. 714–725. Cited by: §2.
  • S. Shalev-Shwartz et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §2, §3.
  • S. Shalev-Shwartz and T. Zhang (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14 (Feb), pp. 567–599. Cited by: §2.
  • Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, and H. Qian (2018) Towards more efficient stochastic decentralized learning: faster convergence and sparse communication. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4624–4633. Cited by: §2.
  • R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §1, §4.2.
  • V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017a) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §2.
  • V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017b) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §1.
  • V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, and M. Jaggi (2016) CoCoA: A General Framework for Communication-Efficient Distributed Optimization. arXiv:1611.02189 [cs] (en). Note: arXiv: 1611.02189 External Links: Link Cited by: §2.
  • S. U. Stich (2018) Local SGD Converges Fast and Communicates Little. External Links: Link Cited by: §2.
  • K. I. Tsianos, S. Lawlor, and M. G. Rabbat (2012) Push-sum distributed dual averaging for convex optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 5453–5458. Cited by: §4.2.
  • K. I. Tsianos and M. G. Rabbat (2012) Distributed dual averaging for convex optimization under communication delays. In 2012 American Control Conference (ACC), pp. 1067–1072. Cited by: §4.1.
  • P. Vanhaesebrouck, A. Bellet, and M. Tommasi (2017) Decentralized collaborative learning of personalized models over networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §1.
  • J. Wang and G. Joshi (2018) Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms. (en). External Links: Link Cited by: §2.
  • T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed (2017) Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks PP, pp. 1–1. External Links: Document Cited by: §2.
  • Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 12. Cited by: §1.
  • T. Yang, S. Zhu, R. Jin, and Y. Lin (2013) Analysis of distributed stochastic dual coordinate ascent. arXiv preprint arXiv:1312.1031. Cited by: §2.
  • T. Yang (2013) Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 629–637. Cited by: §2.
  • H. Yu, S. Yang, and S. Zhu (2018) Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. arXiv:1807.06629 [cs, math] (en). Note: arXiv: 1807.06629 External Links: Link Cited by: §2.
  • Y. Zhao, C. Yu, P. Zhao, and J. Liu (2019) Decentralized online learning: take benefits from others’ data without sharing your own to track global trend. arXiv preprint arXiv:1901.10593. Cited by: §2, §4.2.

7 Proof to Theorem 2 and Corollary 3

Proof.

Since the loss function is assumed to be convex, which leads to

For , we have

Notice that for COL, we have because . So for DOL, in order to bound , we need to bound the difference (using Lemma 9).

Summing up the inequality above from to , we get

Choosing , we have

So we have

Notice that Corollary 3 can be easily verified by setting . ∎

Next, we will present two lemmas for our proof of Lemma 9. The proofs of following two lemmas can be found in existing literature (Nedić and Olshevsky, 2014, 2016; Assran and Rabbat, 2018; Assran et al., 2018).

Lemma 6.

Under the Assumption 1, there exists a constant such that for any , the following holds

(3)

where is a row stochastic matrix.

Lemma 7.

Under the Assumption 1, for any , there always exists a stochastic vector and two constants and such that for any satisfying , the following inequality holds

where is a row stochastic matrix, and is a vector with being its -th entry.

Lemma 8.

Given two non-negative sequences and that satisfying

(4)

with , we have

Proof.

From the definition, we have

(5)

Based on the above three lemmas, we can obtain the following lemma.

Lemma 9.

Under the Assumption 1, the updating rule of Algorithm 1 leads to the following inequality

where is the step size, and are constants. is the matrix for the stochastic gradient at time (e.g., the -th column is the stochastic gradient vector on node at time ).

Proof.

The updating rule of OPS can be formulated as

where is a row stochastic matrix. is a matrix whose each column is . is the matrix of gradient, whose each column is the stochastic gradient at on node . is the matrix whose each column is .

Assuming and , then we have

(6)
(7)
(8)

where is the average of all variables on the nodes, and is the averaged gradient. We have since is a row stochastic matrix.

For , according to Lemma 7, we decompose it as follows