Stochastic Distributed Optimization for Machine Learning from Decentralized Features

12/16/2018 ∙ by Yaochen Hu, et al. ∙ University of Alberta Tencent 0

Distributed machine learning has been widely studied in the literature to scale up machine learning model training in the presence of an ever-increasing amount of data. We study distributed machine learning from another perspective, where the information about the training same samples are inherently decentralized and located on different parities. We propose an asynchronous stochastic gradient descent (SGD) algorithm for such a feature distributed machine learning (FDML) problem, to jointly learn from decentralized features, with theoretical convergence guarantees under bounded asynchrony. Our algorithm does not require sharing the original feature data or even local model parameters between parties, thus preserving a high level of data confidentiality. We implement our algorithm for FDML in a parameter server architecture. We compare our system with fully centralized training (which violates data locality requirements) and training only based on local features, through extensive experiments performed on a large amount of data from a real-world application, involving 5 million samples and 8700 features in total. Experimental results have demonstrated the effectiveness and efficiency of the proposed FDML system.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

While the unprecedented success of modern machine learning models lays the foundation of many intelligent services, the performance of a sophisticated model is often limited by the availability of data. In most applications, however, a large quantity of useful data may be generated on and held by multiple parties. Collecting such data to a central site for training incurs extra management and business compliance overhead, privacy concerns, or even regulation and judicial issues. As an alternative, a number of distributed machine learning techniques have been proposed to collaboratively train a model by letting each party perform local model updates and exchange locally computed gradients [1] or model parameters [2] with the central server to iteratively improve model accuracy. Most of the existing schemes, however, fall into the range of data parallel

computation, where the training records are located on different parties, e.g., different users hold different images to train a joint image classifier or different organizations contribute different corpora to learn a joint language model.

We study distributed machine learning from another perspective, where different features of a same record in the dataset are held by different parties, without requiring any party to share its feature set to a central server or other parties. Such scenarios arise in applications where the data collection process is inherently collaborated among multiple devices, e.g., in mobile sensing, signals about a human user may come from multiple IoT and personal computing devices. Another scenario is that multiple organizations (e.g., different applications of a same company) may happen to have complementary information about a customer; such cross-domain knowledge from another party, if utilized, may help train a joint model and improve the prediction about the customer’s behavior and preferences.

A natural question is—how can we train a joint machine learning model if the features of each training record are located on multiple distributed parties? To make the solution most practical (and conservative in terms of information sharing), we bear the following goals:

  • To minimize information leakage, no party is assumed to be willing to share its feature set. Neither should any of its model parameters be communicated to other parties.

  • The joint model produced should approach the model trained in a centralized manner if all the data were collected centrally.

  • The prediction made by the joint model should outperform the prediction made by each isolated model trained only with a single party’s feature set, provided that the improvement from using the joint features exists in centralized training.

  • The system should be efficient in the presence of both large numbers of features and samples.

In this paper, we design, implement and extensively evaluate a practical Feature Distributed Machine Learning (FDML) system, with theoretical convergence guarantees, to solve the above challenges. For any supervised learning task, e.g., classification, our system enables each party to use an arbitrary model (e.g., logistic regression, factorization machine, SVM, and deep neural networks) to map its local feature set to a local prediction, while different local predictions are aggregated into a final prediction for classification via a “hyper-linear structure,” which is similar to softmax. The entire model is trained end-to-end using a mini-batched stochastic gradient descent (SGD) algorithm performed in the sense of stale synchronous parallel (SSP)

[3], i.e., different parties are allowed to be at different iterations of parameter updates up to a bounded delay.

A highlight of our system is that during each training iteration, every party is solely responsible for updating its own local model parameters using its own mini-batch of local feature sets, and for each record, only needs to share its local prediction to the central server (or to other parties directly in a fully decentralized scenario). Since neither the original features nor the model parameters of a party are transferred to any external sites, the FDML system is more confidentiality-friendly and much less vulnerable to model inversion attacks [4] targeting other collaborative learning algorithms [1, 5] that need to pass model parameters between parties.

We theoretically establish a convergence rate of for the asynchronous FDML algorithm under certain assumptions (including the bounded delay assumption [3]), where is the number of iterations on (the slowest) party, which matches the standard convergence rate of fully centralized synchronous SGD training with a convex loss as well as that known for asynchronously distributed data-parallel SGD in SSP [3].

We developed a distributed implementation of FDML in a parameter server architecture, and conducted experiments based on a large dataset of records and

decentralized features (extracted from different services of a same company) for a real-world app recommendation task in

Tencent MyApp, a major Android app market in China. Extensive experimental results have demonstrated that FDML even closely approaches centralized training in terms of testing errors, the latter of which can use a more complex model, as all features are collected centrally, yet violates the data locality requirement. In the meantime, FDML significantly outperforms models trained only based on a single party’s local features, demonstrating its effectiveness in harvesting insights from additional features held by another party.

Ii Related Work

Distributed Machine Learning. Distributed machine learning algorithms and systems have been extensively studied in recent years to scale up machine learning in the presence of big data and big models. Existing work focuses either on the theoretical convergence speed of proposed algorithms, or on the practical system aspects to reduce the overall model training time [6]. Bulk synchronous parallel algorithms (BSP) [7, 8] are among the first distributed machine learning algorithms. Due to the hash constraints on the computation and communication procedures, these schemes share a convergence speed that is similar to traditional synchronous and centralized gradient-like algorithms. Stale synchronous parallel (SSP) algorithms [3] are a more practical alternative that abandons strict iteration barriers, and allows the workers to be off synchrony up to a certain bounded delay. The convergence results have been developed for both gradient descent and SGD [9, 3, 10] as well as proximal gradient methods [11]

under different assumptions of the loss functions. In fact, SSP has become central to various types of current distributed Parameter Server architectures

[12, 13, 14, 15, 16, 17].

Depending on how the computation workload is partitioned [6], distributed machine learning systems can be categorized into data parallel and model parallel systems. Most of existing distributed machine learning systems [12, 13, 14, 15, 16, 17] fall into the range of data parallel, where different workers hold different training samples.

Model Parallelism. There are only a couple of studies on model parallel systems, i.e., DistBelief [18] and STRADS [19], which aims to train a big model by letting each worker be responsible for updating a subset of model parameters. However, both DistBelief and STRADS, require collaborating workers to transmit their local model parameters to each other (or to a server), which violates our non-leakage requirement for models and inevitably incurs more transmission overhead. Furthermore, nearly all recent advances on model parallel neural networks (e.g., DistBelief [18] and AMPNet [20]) mainly partition the network horizontally according to neural network layers with motivation to scale up computation to big models. In contrast, we study a completely vertical partition strategy based strictly on features, which is motivated by the cooperation between multiple businesses/organizations that hold different aspects of information about the same samples. Another difference is that we do not require transmitting the model parameters; nor any raw feature data between parties.

On a theoretical perspective of model parallel algorithm analysis, [5] has proposed and analyzed the convergence of a model parallel yet non-stochastic proximal gradient algorithm that requires passing model parameters between workers under the SSP setting. Parallel coordinate descent algorithms have been analyzed recently in [21, 22]. Yet, these studies focus on randomized coordinate selection in a synchronous setting, which is different from our setting where multiple nodes can update disjoint model blocks asynchronously. Although Stochastic gradient descent (SGD) is the most popular optimization method extensively used for modern distributed data analytics and machine learning, to the best of our knowledge, there is still no convergence result of (asynchronous) SGD in a model parallel setting to date. Our convergence rate of FDML offers the first analysis of asynchronous model parallel SGD, which matches the standard convergence rate of the original SSP algorithm [3] for data parallel SGD.

Learning Privately.

A variant of distributed SGD with a filter to suppress insignificant updates has recently been applied to collaborative deep learning among multiple parties in a data parallel fashion

[1]. Although raw data are not transferred by the distributed SGD in [1], a recent study [4] points out that an algorithm that passes model parameters may be vulnerable to model inversion attacks based on generative adversarial networks (GANs). In contrast, we do not let parties transfer local model parameters to server or any other party.

Aside from the distributed optimization approach mentioned above, another approach to privacy preserving machine learning is through feature encryption, e.g., via homomorphic encryption [23, 24]. Models are then trained on encrypted data. However, this approach cannot be flexibly generalized to all algorithms and operations, and incurs additional computation and design cost. Relatively earlier, differential privacy has also been applied to collaborative machine learning [25, 26], with an inherent tradeoff between privacy and utility of the trained model. To the best of our knowledge, none of the previous work addressed the problem of collaborative learning when the features of each training sample are distributed on multiple participants.

Iii Problem Formulation

Fig. 1: An illustration of the FDML model (2), where each party may adopt an arbitrary local model that is trainable via SGD. The local predictions, which only depend on the local model parameters, are aggregated into a final output using linear and nonlinear transformations (1).

Consider a system of different parties, each party holding different aspects about the same training samples. Let represent the set of

training samples, where the vector

denotes the features of the th sample located on th party, and is the label of sample . Let be the overall feature vector of sample , which is a concatenation of the vectors , with . Suppose the parties are not allowed to transfer their respective feature vector to each other out of regulatory and privacy reasons as has been mentioned above. In our problem, the feature vectors on two parties may or may not contain overlapped features. The goal of machine learning is to find a model with parameters that given an input , can predict its label , by minimizing the loss between the model prediction and its corresponding label over all training samples .

We propose a Feature Distributed Machine Learning (FDML) algorithm that can train a joint model by utilizing all the distributed features while keeping the raw features at each party unrevealed to other parties. To achieve this goal, we adopt a specific class of model that has the form


where , , is a sub-model on party with parameters , which can be a general function that maps the local features on each party to a local prediction. In addition, is a continuously differentiable function to aggregate local intermediate predictions weighted by . Note that , with , is a concatenation of the local model parameters over all parties .

As illustrated by Fig. 1, the model adopted here is essentially a composite model, where each sub-model on party with parameters could be an arbitrary model, e.g., logistic regression, SVM, deep neural networks, factorization machines, etc. Each sub-model on party is only concerned with the local features . The final prediction is made by merging the local intermediate results through a linear followed by nonlinear transformations, e.g., a softmax function. Note that in (1), all can be eliminated by scaling some corresponding parameters in by . Without loss of generality, we simplify the model to the following:


Apparently, in this model, both the local features and the sub-model parameters are stored and processed locally within party , while only the local predictions need be shared to produce the final prediction. Therefore, the raw features as well as all sub-model parameters are kept private. In Sec. IV, we propose an asynchronous SGD algorithm that also preserves the non-sharing properties for all the local features as well as all sub-model parameters even during the model training phase, with theoretical convergence guarantees.

In general, the model is trained by solving the following problem:


where is the loss function, indicating the gap between the predicted value and the true label for each sample. is the regularizer for sub-model .

Iv Asynchronous SGD Algorithm for FDML

In this section, we describe our asynchronous and distributed stochastic gradient descent (SGD) algorithm specifically designed to solve the optimization problem (3) in FDML, with theoretical convergence guarantees.

Since we consider a stochastic algorithm, let be the index of the sample presented to the training algorithm in iteration . To simplify notations, we denote the regularized loss of sample by


Thus, in stochastic optimization, minimizing the loss in (3) over the entire training set is equivalently to solving the following problem [3]:


where is the total number of iterations. Let be the gradient of . Let be the partial gradient of with respect to the sub-model parameters , i.e., . Clearly, is the concatenation of all the partial gradients .

Iv-a The Synchronous Algorithm

In a synchronous setting, we can simply parallelizing a SGD algorithm by updating each parameter block concurrently for all , given a coming sample , i.e.,

where is a predefined learning rate scheme. Specifically for model (2), according to (4), we can obtain the partial gradient for as


where we simplify the notation of the first few terms related to by a function . In practice, could be non-smooth. This setting is usually handled by proximal methods. In this work, we are only focused on the smooth case.

This indicates that for the class of models in (2) adopted by FDML, each party does not even need other parties’ models , where , to compute its partial gradient . Instead, to compute in (7), each party only needs one term, , which is the aggregation of the local prediction results from all parties at iteration , while the remaining terms in (7) is only concerned with party ’s local model and local features . Therefore, this specific property enables a parallel algorithm with minimum sharing among parties, where neither local features nor local model parameters need be passed among parties.

Iv-B The Asynchronous Algorithm

However, the asynchronous implementation of this idea in a distributed setting of multiple parties, with theoretical convergence guarantees, is significantly more challenging than it seems. As our proposed algorithm is closely related to asynchronous SGD, yet extends it from the data-parallel setting [3] to a block-wise model parallel setting, we would call our algorithm Asynchronous SGD for FDML.

Note that in an asynchronous setting, each party will update its own parameters asynchronously and two parties may be in different iterations. However, we assume different parties go through the samples in the same order, although asynchronously, i.e., all the parties share the randomly generated sample index sequence , which can easily be realized by sharing the seed of a pseudo random number generator.

When each party has its own iteration , the local model parameters on party is updated by


where the requested aggregation of local predictions for sample may be computed from possibly stale versions of model parameters, on other parties , where represents how many iterations of a lag there are from party to party at the th iteration of party . In other words, at the th iteration of party , the latest local model on another party was updated iterations ago. We give a convergence speed guarantee of the proposed algorithm under certain assumptions, when the lag is bounded.

V Distributed Implementation

We describe a distributed implementation of the proposed asynchronous SGD algorithm for FDML. Our implementation is inspired by the Parameter Server architecture [12, 11, 13]. In a typical Parameter Server system, the clients compute gradients while the server updates the model parameters with the gradients computed by clients. Yet, in our implementation, as described in Algorithm 1, the only job of the server is to maintain and update a matrix , , , which is introduced to hold the latest local predictions for each sample . We call the local prediction matrix. On the other hand, unlike parameter servers, the workers in our system, each representing a participating party, do not only compute gradients; they also need to update their respective local model parameters with SGD.

Furthermore, since each worker performs local updates individually, each worker can further employ a parameter server cluster or a shared-memory system, e.g., a CPU/GPU cluster, to scale up and parallelize the computation workload related to the arbitrary local model (e.g., a DNN or FM). A similar hierarchical cluster is considered in Gaia

[17] for data-parallel machine learning among multiple data centers.

Require: each worker holds the local feature set , ; a sample presentation schedule , , is pre-generated randomly and shared among workers.
    Output: model parameters .

  Initialize the local prediction matrix .
  while True do
           if Pull request (worker: , iteration: ) received  then
                    if  is not iterations ahead of the slowest worker then
                             Send to Worker
                             Reject the Pull request
                    end if
           end if
           if Push request (worker: , iteration: , value: ) received then
           end if
  end while
  Worker () asynchronously performs:
  for  do
           while Pull not successful do
                    Pull from Server
           end while
           Push to Server.
  end for
Algorithm 1 A Distributed Implementation of FDML

First, we describe how the input data should be prepared for the FDML system. Before the training task, for consistency and efficiency, a sample coordinator will first randomly shuffle the sample indices and generate the sample presentation schedule , which dictates the order in which samples should be presented to the training algorithm. However, since features of a same sample are located on multiple parties, we need to find all the local features as well as the label associated with sample . This can be done by using some common identifiers that are present in all local features of a sample, like user IDs, phone numbers, data of birth plus name, item IDs, etc. Finally, the labels will be sent to all workers (parties) so that they can compute error gradients locally. Therefore, before the algorithm starts, each worker holds a local dataset , for all .

Let us explain Algorithm 1 from a worker’s perspective.

To solve for collaboratively, each worker goes through the iterations individually and asynchronously in parallel, according to the (same) predefined sample presentation schedule and updates its local model according to (9). In a particular iteration , when worker updates with the current local features , it needs to use the latest pulled from the server, which is given by based on the latest versions of local predictions, , maintained on the server for all the workers . After is updated into locally by (9), worker needs to send its updated local prediction about sample to the server in order to update , i.e., . And this update is done through the value uploaded to the server in a Push request from worker with iteration index and value .

Since the workers perform local model updates asynchronously, at a certain point, different workers might be in different iterations, and a faster worker may be using the stale local predictions from other workers. We adopt a stale synchronous protocol to strike a balance between the evaluation time for each iteration and the total number of iterations to converge—a fully synchronous algorithm takes the least number of iterations to converge yet incurs large waiting time per iteration due to straggler workers, while on the other hand, an asynchronous algorithm reduced the per iteration evaluation time, at the possible cost of more iterations to converge. In order to reduce the overall training time, we require that the iteration of the fastest party should not exceed the iteration of the slowest party by , i.e., the server will reject a pull request if the from the Pull request(worker: , iteration: ) is iterations ahead of the slowest worker in the system. A similar bounded delay condition is enforced in most Parameter-Server-like systems [12, 13, 14, 15, 16, 17] to ensure convergence and avoid chaotic behavior of a completely asynchronous system.

The matrix to hold local prediction results can be initialized as , with model parameters

initialized randomly. If the total number of epochs is small, where an epoch is defined as one complete presentation of the entire dataset to the training process, we will perform a synchronous algorithm (e.g., by setting

to zero or a very small value) in the first epoch to obtain some relatively reliable initial values for .

In real applications, the SGD algorithm can easily be replaced with the mini-batched SGD, by replacing the sample presentation schedule with a set representing the indices of a mini-batch of samples to be used iteration , and replacing the partial gradient in (8) with the sum of partial gradients over the mini-batch .

Finally, it is worth noting that the implementation in Algorithm 1 can be replaced by a completely peer-to-peer version without a server, where each party simply broadcasts its updated local prediction result for each sample to other parties.

Vi Convergence Analysis

Inspired by a series of studies [27, 3, 17] on the convergence behavior of convex objective functions, we analyze the convergence property of the proposed asynchronous algorithm by evaluating a regret function, which is the difference between the aggregated training loss and the loss of the optimal solution, i.e., the regret is defined as


where is the optimal solution for , such that . Note that during training, the same set of data will be looped through for several epochs. This is as if a very large dataset is gone through till th iteration. We will prove convergence by showing that will decrease to with regard to . Before presenting the main result, we introduce several notations and assumptions. We use to denote the distance measure from to , i.e., .

We make the following common assumptions on the loss function, which are used in many related studies as well.

Assumption 1
  1. The function is differentiable and the partial gradient are blockwise Lipschitz continuous with , namely,


    for . We denote as the maximum among the for .

  2. Convexity of the loss function .

  3. Bounded solution space. There exists a , s.t., for .

As a consequence of the assumptions, the gradients are bounded, i.e., , s.t., for

With these assumptions, we come to our main result on the convergence rate of the proposed SGD algorithm.

Proposition 1

Under circumstances of the assumptions in Assumption 1, with a learning rate of , and a bounded staleness of , the regret given by the updates (8) for the FDML problem is .

Proof. Please refer to Appendix for the proof.

Vii Experiments

(a) Training objective vs. epoch
(b) Tesiting log loss vs. epoch
(c) Tesiting AUC vs. epoch
(d) Training objective vs. time
Fig. 2: A comparison between the three model training schemes for the LR model. All curves are plotted for epochs 1–40, including the time curve in (d).
(a) Training objective vs. epoch
(b) Tesiting log loss vs. epoch
(c) Tesiting AUC vs. epoch
(d) Training objective vs. time
Fig. 3: A comparison between the three model training schemes for the NN model. All curves are plotted for epochs 1–40, including the time curve in (d).

We evaluate the proposed FDML system on a realistic app recommendation task from Tencent MyApp, which is a major Android market with an extremely large body of users. In this task, user features, including the past download activities in the Android store, are recorded. In the meantime, the task can also benefit from other features about the same users logged in two other services (run by different departments of the same company), including their browsing history in QQ web browser app that tracks their interests into different types of content, as well as their app invoking and usage history recorded by an Android security app named WeSecure. The goal here is to leverage the additional user features available from the other domains to improve the app recommendation in the Android store of question, yet without downloading the raw user features from other departments due to regulatory and privacy issues. One reason is that customer data in different departments are protected under different security levels and even under different agreements. Some sensitive features under strong protection are prohibited to be moved to other parties, including other departments.

The dataset we use contains labeled samples indicating whether a user will download an app or not. Each sample is a user-app pair, which contains around (sparse) features in total, among which around features come from the Android app store itself, while the remaining features are from the other two departments. We run both a logistic regression (LR) and a two layered fully connected neural network (NN) under three different training schemes:

  • Local: only use the local features from the Android app store itself to train a model.

  • Centralized: collect all the features from all three departments to a central server (violating the data locality requirement) and train the model using the standard mini-batched SGD.

  • FDML: use FDML system to train a joint model for app recommendation based on all features located in three different departments as is, without centrally collecting data.

For FDML, there a single server with three workers, each of which is equipped with an Intel Xeon CPU E5-2670 v3 @ 2.30GHz. Each worker handles the features from one of the three departments. The system will be asynchronous as the lengths of features handled by each worker are different. The FDML NN only considers a fully connected NN within each party while merging the three local predictions in a composite model, whereas the Centralized NN uses a fully connected neural network over all the features, thus leading to a more complex model (with interactions between the local features of different departments) than FDML NN.

We randomly shuffle the data and split it into a 4.5 million training set and a 0.5 million testing set. For all training schemes, a mini-batched SGD is used with a batch size of 100. For each epoch, there are batches of updates. For each epoch, we keep track of the optimization objective value for training data, the log loss and the AUC for testing data as long as the elapsed time of the epoch. Fig. 2 and Fig. 3 present the major statistics of the models during the training procedure for LR and NN, respectively. Table I presents the detailed statistics at the end of epoch 10, when all the algorithms yield a stable and good performance on the testing data. The results show that FDML outperforms the corresponding Local scheme with only local features, and even approaches the performance of the Centralized scheme, while keeping the feature sets local to their respective workers.

For LR, as shown by Fig. 2 and Table I, we can see that Centralized LR and FDML LR both achieve a smaller training objective value as well as significantly better performance on the testing set than Local LR. As we have expected, additional features recorded by other related services could indeed help improve the app recommendation performance. Furthermore, Centralizd LR and FDML LR have very close performance, since these two methods use the essentially the same model for LR, though with different training algorithms.

For NN shown in Fig. 3 and Table I, by leveraging additional features, both FDML NN and Centralized NN substantially outperform Local NN. Meanwhile, Centralized NN is slightly better than FDML NN, since Centralized NN has essentially adopted a more complex model, enabling feature interaction between different parties directly through fully connected neural networks.

Fig. 2(d) and Fig. 3(d) compare the training time and speed among the three learning schemes. Without surprise, for both the LR and NN model, the Local scheme is the fastest since it uses the smallest amount of features and has no communication or synchronization overhead. For LR in Fig. 2(d), FDML LR is slower than Centralized LR since the computation load is relatively smaller in this LR model and thus the communication overhead dominates. On the contrary, for NN, as shown in Fig. 3(d), the Centralized NN is slower than FDML NN. This is because Centralized NN has much more inner connections and hence much more model parameters to train. Another reason is that FDML distributes the heavy computation load in this NN scenario to three different workers, which in fact speeds up training.

Algorithm Train objective Test log loss Test AUC Time (s)
LR local 0.1183 0.1220 0.6573 546
LR centralized 0.1159 0.1187 0.7037 1063
LR FDML 0.1143 0.1191 0.6971 3530
NN local 0.1130 0.1193 0.6830 784
NN centralized 0.1083 0.1170 0.7284 8051
NN FDML 0.1101 0.1167 0.7203 4369
TABLE I: The performance of different algorithms.

Viii Conclusions

We study a feature distributed machine learning (FDML) problem motivated by real-world applications in industry, where the features of the same training sample are inherently decentralized and located on multiple parties. This motivation is in contrast to most existing literature on collaborative learning which assumes the data samples (but not the features) are distributed. We propose an asynchronous SGD algorithm to solve the new FDML problem, with a convergence rate of , being the total number of iterations, matching the existing convergence rate known for data-parallel SGD in a stale synchronous parallel setting [3]. We have developed a distributed implementation of the FDML system in a parameter server architecture and performed extensive evaluation based on a large dataset of records and features for a realistic app recommendation task. Results show that FDML can closely approximate centralized training (the latter collecting all data centrally and using a more complex model) in terms of the testing AUC and log loss, while significantly outperforming models trained only based on a single party’s local features.


  • [1] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security.   ACM, 2015, pp. 1310–1321.
  • [2] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
  • [3] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Advances in neural information processing systems, 2013, pp. 1223–1231.
  • [4] B. Hitaj, G. Ateniese, and F. Perez-Cruz, “Deep models under the gan: information leakage from collaborative deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2017, pp. 603–618.
  • [5] Y. Zhou, Y. Yu, W. Dai, Y. Liang, and E. Xing, “On convergence of model parallel proximal gradient algorithm for stale synchronous parallel system,” in Artificial Intelligence and Statistics, 2016, pp. 713–722.
  • [6] E. P. Xing, Q. Ho, P. Xie, and D. Wei, “Strategies and principles of distributed machine learning on big data,” Engineering, vol. 2, no. 2, pp. 179–195, 2016.
  • [7] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimal distributed online prediction using mini-batches,” Journal of Machine Learning Research, vol. 13, no. Jan, pp. 165–202, 2012.
  • [8] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochastic gradient descent,” in Advances in neural information processing systems, 2010, pp. 2595–2603.
  • [9] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in neural information processing systems, 2011, pp. 693–701.
  • [10] X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 2737–2745.
  • [11] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems, 2014, pp. 19–27.
  • [12] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server.” in OSDI, vol. 14, 2014, pp. 583–598.
  • [13] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: Building an efficient and scalable deep learning training system.” in OSDI, vol. 14, 2014, pp. 571–582.
  • [14] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
  • [15] M. Li, Z. Liu, A. J. Smola, and Y.-X. Wang, “Difacto: Distributed factorization machines,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining.   ACM, 2016, pp. 377–386.
  • [16] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al.

    , “Tensorflow: A System for Large-Scale Machine Learning,” in

    Proc. USENIX Symposium on Operating System Design and Implementation (OSDI), 2016.
  • [17] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu, “Gaia: Geo-distributed machine learning approaching lan speeds.” in NSDI, 2017, pp. 629–647.
  • [18] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223–1231.
  • [19] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing, “On model parallelization and scheduling strategies for distributed machine learning,” in Advances in neural information processing systems, 2014, pp. 2834–2842.
  • [20] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” arXiv preprint arXiv:1802.09941, 2018.
  • [21] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel coordinate descent for l1-regularized loss minimization,” arXiv preprint arXiv:1105.5379, 2011.
  • [22] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin, “Feature clustering for accelerating parallel coordinate descent,” in Advances in Neural Information Processing Systems, 2012, pp. 28–36.
  • [23] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy,” in International Conference on Machine Learning, 2016, pp. 201–210.
  • [24] H. Takabi, E. Hesamifard, and M. Ghasemi, “Privacy preserving multi-party machine learning with homomorphic encryption,” in 29th Annual Conference on Neural Information Processing Systems (NIPS), 2016.
  • [25] M. Pathak, S. Rane, and B. Raj, “Multiparty differential privacy via aggregation of locally trained classifiers,” in Advances in Neural Information Processing Systems, 2010, pp. 1876–1884.
  • [26] A. Rajkumar and S. Agarwal, “A differentially private stochastic gradient descent algorithm for multiparty classification,” in Artificial Intelligence and Statistics, 2012, pp. 933–941.
  • [27] J. Langford, A. J. Smola, and M. Zinkevich, “Slow learners are fast,” Advances in Neural Information Processing Systems, vol. 22, pp. 2331–2339, 2009.

Ix Appendix

Proof of Proposition 1. By the proposed algorithm and from (8), we have


where is the concatenated model parameters with staleness in which . Note that we always have . To help proving the proposition, we first prove a lemma.

Lemma 1

Dividing the above equation by and rearranging it, we can get the lemma.

Another important fact for our analysis is


We now come to evaluate the regret up to iteration . By the definition in (10) and, we have


where (19) follows from the convexity of the loss functions. Inserting the result from lemma 1, we can get


We look into the three terms of (20) and bound them.

For the first term, we have


where (23) comes from the fact in (15). For the second term, we have


Finally we come to the third term. We have


(27) is from triangle inequality. (28) comes from the Assumption 1’s blockwise Lipschitz continuity. (34) comes from the fact


For the last parts of (35), we have


where (40) is from the fact (15). Combining (35) and (41), we get


Combining (20), (23), (26) and (42), and dividing by , we have