Learning Privately over Distributed Features: An ADMM Sharing Approach

07/17/2019 ∙ by Yaochen Hu, et al. ∙ University of Alberta University of Kent 0

Distributed machine learning has been widely studied in order to handle exploding amount of data. In this paper, we study an important yet less visited distributed learning problem where features are inherently distributed or vertically partitioned among multiple parties, and sharing of raw data or model parameters among parties is prohibited due to privacy concerns. We propose an ADMM sharing framework to approach risk minimization over distributed features, where each party only needs to share a single value for each sample in the training process, thus minimizing the data leakage risk. We establish convergence and iteration complexity results for the proposed parallel ADMM algorithm under non-convex loss. We further introduce a novel differentially private ADMM sharing algorithm and bound the privacy guarantee with carefully designed noise perturbation. The experiments based on a prototype system shows that the proposed ADMM algorithms converge efficiently in a robust fashion, demonstrating advantage over gradient based methods especially for data set with high dimensional feature spaces.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The effectiveness of a machine learning model does not only depend on the quantity of samples, but also the quality of data, especially the availability of high-quality features. Recently, a wide range of distributed and collaborative machine learning schemes, including gradient-based methods Li et al. (2014a); Li et al. (2014b); Hsieh et al. (2017); Ho et al. (2013) and ADMM-based methods Zhang et al. (2018); Shi et al. (2014); Zhang and Zhu (2016a); Huang et al. (2018), have been proposed to enable learning from distributed samples, since collecting data for centralized learning will incur compliance overhead, privacy concerns, or even judicial issues. Most existing schemes, however, are under the umbrella of data parallel

schemes, where multiple parties possess different training samples, each sample with the same set of features. For example, different users hold different images to jointly train a classifier.

An equally important scenario is to collaboratively learn from distributed features, where multiple parties may possess different features about a same sample, yet do not wish to share these features with each other. Examples include a user’s behavioural data logged by multiple apps, a patient’s record stored at different hospitals and clinics, a user’s investment behavior logged by multiple financial institutions and government agencies and so forth. The question is—how can we train a joint model to make predictions about a sample leveraging the potentially rich and vast features possessed by other parties, without requiring different parties to share their data to each other?

The motivation of gleaning insights from vertically partitioned data dates back to association rule mining Vaidya and Clifton (2002, 2003). A few very recent studies Kenthapadi et al. (2013); Ying et al. (2018); Hu et al. (2019); Heinze-Deml et al. (2018); Dai et al. (2018); Stolpe et al. (2016) have reinvestigated vertically partitioned features under the setting of distributed machine learning, which is motivated by the ever-increasing data dimensionality as well as the opportunity and challenge of cooperation between multiple parties that may hold different aspects of information about the same samples.

In this paper, we propose an ADMM algorithm to solve the empirical risk minimization (ERM) problem, a general optimization formulation of many machine learning models visited by a number of recent studies on distributed machine learning Ying et al. (2018); Chaudhuri et al. (2011). We propose an ADMM-sharing-based distributed algorithm to solve ERM, in which each participant does not need to share any raw features or local model parameters to other parties. Instead, each party only transmits a single value for each sample to other parties, thus largely preventing the local features from being disclosed. We establish theoretical convergence guarantees and iteration complexity results under the non-convex loss in a fully parallel setting, whereas previously, the convergence of ADMM sharing algorithm for non-convex losses is only known for the case of sequential (Gauss-Seidel) execution Hong et al. (2016).

To further provide privacy guarantees, we present a privacy-preserving version of the ADMM sharing algorithm, in which the transmitted value from each party is perturbed by a carefully designed Gaussian noise to achieve the notion of -differential privacy Dwork (2008); Dwork et al. (2014)

. For distributed features, the perturbed algorithm ensures that the probability distribution of the values shared is relatively insensitive to any change to a single feature in a party’s local dataset.

Experimental results on two realistic datasets suggest that our proposed ADMM sharing algorithm can converge efficiently. Compared to the gradient-based method, our method can scale as the number of features increases and yields robust convergence. The algorithm can also converge with moderate amounts of Gaussian perturbation added, therefore enabling the utilization of features from other parties to improve the local machine learning task.

1.1 Related Work

Machine Learning Algorithms and Privacy. Chaudhuri and Monteleoni (2009)

is one of the first studies combing machine learning and differential privacy (DP), focusing on logistic regression.

Shokri and Shmatikov (2015)

applies a variant of SGD to collaborative deep learning in a data-parallel fashion and introduces its variant with DP.

Abadi et al. (2016)

provides a stronger differential privacy guarantee for training deep neural networks using a momentum accountant method.

Pathak et al. (2010); Rajkumar and Agarwal (2012) apply DP to collaborative machine learning, with an inherent tradeoff between the privacy cost and utility achieved by the trained model. Recently, DP has been applied to ADMM algorithms to solve multi-party machine learning problems Zhang et al. (2018); Zhang and Zhu (2016a); Zhang et al. (2019); Zhang and Zhu (2017).

However, all the work above is targeting the data-parallel scenario, where samples are distributed among nodes. The uniqueness of our work is to enable privacy-preserving machine learning among nodes with vertically partitioned features, or in other words, the feature-parallel setting, which is equally important and is yet to be explored.

Another approach to privacy-preserving machine learning is through encryption Gilad-Bachrach et al. (2016); Takabi et al. (2016); Kikuchi et al. (2018) or secret sharing Mohassel and Zhang (2017); Wan et al. (2007); Bonte and Vercauteren (2018), so that models are trained on encrypted data. However, encryption cannot be generalized to all algorithms or operations, and incurs additional computational cost.

Learning over Distributed Features. Gratton et al. (2018)

applies ADMM to solve ridge regression.

Ying et al. (2018)

proposes a stochastic learning method via variance reduction.

Zhou et al. (2016) proposes a proximal gradient method and mainly focuses on speeding up training in a model-parallel scenario. These studies do not consider the privacy issue. Hu et al. (2019) proposes a composite model structure that can jointly learn from distributed features via a SGD-based algorithm and its DP-enabled version, yet without offering theoretical privacy guarantees. Our work establishes the first

-differential privacy guarantee result for learning over distributed features. Experimental results further suggest that our ADMM sharing method converges in fewer epochs than gradient methods in the case of high dimensional features. This is critical to preserving privacy in machine learning since the privacy loss increases as the number of epochs increases

Dwork et al. (2014).

Querying Vertically Partitioned Data Privately. Vaidya and Clifton (2002); Evfimievski et al. (2004); Dwork and Nissim (2004) are among the early studies that investigate the privacy issue of querying vertically partitioned data. Kenthapadi et al. (2012) adopts a random-kernel-based method to mine vertically partitioned data privately. These studies provide privacy guarantees for simpler static queries, while we focus on machine learning jobs, where the risk comes from the shared values in the optimization algorithm. Our design simultaneously achieves minimum message passing, fast convergence, and a theoretically bounded privacy cost under the DP framework.

2 Empirical Risk Minimization over Distributed Features

Consider samples, each with features distributed on parties, which do not wish to share data with each other. The entire dataset can be viewed as vertical partitions , where denotes the data possessed by the th party and is the dimension of features on party . Clearly, . Let denote the th row of , and be the th row of (). Then, we have

where , (). Let be the label of sample .

Let represent the model parameters, where are the local parameters associated with the th party. The objective is to find a model with parameters to minimize the regularized empirical risk, i.e.,

where is a closed convex set and the regularizer prevents overfitting.

Similar to recent literature on distributed machine learning Ying et al. (2018); Zhou et al. (2016), ADMM Zhang and Zhu (2016a); Zhang et al. (2018), and privacy-preserving machine learning Chaudhuri et al. (2011); Hamm et al. (2016), we assume the loss has a form

where we have abused the notation of and in the second equality absorbed the label into the loss

, which is possibly a non-convex function. This framework incorporates a wide range of commonly used models including support vector machines, Lasso, logistic regression, boosting, etc.

Therefore, the risk minimization over distributed features, or vertically partitioned datasets , can be written in the following compact form:

subject to (2)

where is a closed convex set for all .

We have further assumed the regularizer is separable such that This assumption is consistent with our algorithm design philosophy—under vertically partitioned data, we require each party focus on training and regularizing its local model , without sharing any local model parameters or raw features to other parties at all.

3 The ADMM Sharing Algorithm

We present an ADMM sharing algorithm Boyd et al. (2011); Hong et al. (2016) to solve Problem (1) and establish a convergence guarantee for the algorithm. Our algorithm requires each party only share a single value to other parties in each iteration, thus requiring the minimum message passing. In particular, Problem (1) is equivalent to

s.t. (4)

where is an auxiliary variable. The corresponding augmented Lagrangian is given by


where is the dual variable and is the penalty factor. In the iteration of the algorithm, variables are updated according to


Formally, in a distributed and fully parallel manner, the algorithm is described in Algorithm 1. Note that each party needs the value to complete the update, and Line 3, 4 and 12 in Algorithm 1 present a trick to reduce communication overhead.

1:  —–Each party performs in parallel:
2:  for  in  do
3:     Pull and from central node
4:     Obtain by subtracting the locally cached from the pulled value
5:     Compute according to (6)
6:     Push to the central node
7:  —–Central node:
8:  for  in  do
9:     Collect for all
10:     Compute according to (7)
11:     Compute according to (8)
12:     Distribute and to all the parties.
Algorithm 1 The ADMM Sharing Algorithm

3.1 Convergence Analysis

We follow Hong et al. Hong et al. (2016) to establish the convergence guarantee of the proposed algorithm under mild assumptions. Note that Hong et al. (2016) provides convergence analysis for the Gauss-Seidel version of the ADMM sharing, where are updated sequentially, which is not naturally suitable to parallel implementation. In (6) of our algorithm, ’s can be updated by different parties in parallel in each iteration. We establish convergence as well as iteration complexity results for this parallel scenario, which is more realistic in distributed learning. We need the following set of common assumptions.

Assumption 1
  1. There exists a positive constant such that

    Moreover, for all , ’s are closed convex sets; each

    is of full column rank so that the minimum eigenvalue

    of matrix is positive.

  2. The penalty parameter is chosen large enough such that

    1. each subproblem (6) as well as the subproblem (7) is strongly convex, with modulus and , respectively.

    2. , where is the maximum eigenvalue for matrix .

    3. and .

  3. The objective function in Problem 1 is lower bounded over and we denote the lower bound as .

  4. is either smooth nonconvex or convex (possibly nonsmooth). For the former case, there exists such that for all .

Specifically, 1, 3 and 4 in Assumptions 1 are common settings in the literature. Assumptions 1.2 is achievable if the is chosen large enough.

Denote as the index set, such that when , is convex, otherwise, is nonconvex but smooth. Our convergence results show that under mild assumptions, the iteratively updated variables eventually converge to the set of primal-dual stationary solutions.

Theorem 1

Suppose Assumption 1 holds true, we have the following results:

  1. =0.

  2. Any limit point of the sequence is a stationary solution of problem (1) in the sense that

  3. If is a compact set for all , then converges to the set of stationary solutions of problem (1), i.e.,

    where is the set of primal-dual stationary solutions for problem (1).

3.2 Iteration Complexity Analysis

We evaluate the iteration complexity over a Lyapunov function. More specifically, we define as



with . It is easy to verify that when , a stationary solution is achieved due to the properties. The result for the iteration complexity is stated in the following theorem, which provides a quantification of how fast our algorithm converges. Theorem 2 shows that the algorithm converges in the sense that the Lyapunov function will be less than any within iterations.

Theorem 2

Suppose Assumption 1 holds. Let denote the iteration index in which:

for any . Then there exists a constant , such that


where is the lower bound defined in Assumption 1.

4 Differentially Private ADMM Sharing

Differential privacy Dwork et al. (2014); Zhou et al. (2010) is a notion that ensures a strong guarantee for data privacy. The intuition is to keep the query results from a dataset relatively close if one of the entries in the dataset changes, by adding some well designed random noise into the query, so that little information on the raw data can be inferred from the query. Formally, the definition of differential privacy is given in Definition 1.

Definition 1

A randomized algorithm is differentially private if for all , and for all and , such that :


In our ADMM algorithm, the shared messages may reveal sensitive information from the data entry in of Party . We perturb the shared value in Algorithm 1 with a carefully designed random noise to provide differential privacy. The resulted perturbed ADMM sharing algorithm is the following updates:


In the remaining part of this section, we demonstrate that (16) guarantees  differential privacy with outputs for some carefully selected . Beside Assumption 1, we introduce another set of assumptions widely used by the literature.

Assumption 2
  1. The feasible set and the dual variable are bounded; their norms have an upper bound .

  2. The regularizer is doubly differentiable with , where is a finite constant.

  3. Each row of is normalized and has an norm of 1.

Note that Assumption 2.1 is adopted in Sarwate and Chaudhuri (2013) and Wang et al. (2019). Assumption 2.2 comes from Zhang and Zhu (2016b) and Assumption 2.3 comes from Zhang and Zhu (2016b) and Sarwate and Chaudhuri (2013). As a typical method in differential privacy analysis, we first study the sensitivity of , which is defined by:

Definition 2

The -norm sensitivity of is defined by:

where and are two neighbouring datasets differing in only one feature column, and is the derived from the first line of equation (16) under dataset .

We have Lemma 1 to state the upper bound of the -norm sensitivity of .

Lemma 1

Assume that Assumption 1 and Assumption 2 hold. Then the -norm sensitivity of is upper bounded by .

Theorem 3

Assume assumptions 2.1-2.3 hold and is the upper bound of . Let be an arbitrary constant and let

be sampled from zero-mean Gaussian distribution with variance

, where

Then each iteration guarantees -differential privacy. Specifically, for any neighboring datasets and , for any output and , the following inequality always holds:

With an application of the composition theory in Dwork et al. (2014), we come to a result stating the overall privacy guarantee for the whole training procedure.

Corollary 1

For any , the algorithm described in (16) satisfies differential privacy within epochs of updates, where


5 Experiments

(a) Loss vs. epoch
(b) Test log loss under different noise levels
Figure 1: Performance over the a9a data set with 32561 training samples, 16281 testing samples and 123 features.
(a) Loss vs. epoch
(b) Test log loss under different noise levels
Figure 2: Performance over the gisette data set with 6000 training samples, 1000 testing samples and 5000 features.
(a) a9a data set
(b) gisette data set
Figure 3: Test performance for ADMM under different levels of added noise.

We test our algorithm by training -norm regularized logistic regression on two popular public datasets, namely, a9a from UCI Dua and Graff (2017) and giette Guyon et al. (2005). We get the datasets from Lib ([n.d.]) so that we follow the same preprocessing procedure listed there. a9a dataset is MB and contains 32561 training samples, 16281 testing samples and 123 features. We divide the dataset into two parts, with the first part containing the first 66 features and the second part remaining 57 features. The first part is regarded as the local party who wishes to improve its prediction model with the help of data from the other party. On the other hand, gisette dataset is MB and contains 6000 training samples, 1000 testing samples and 5000 features. Similarly, we divide the features into 3 parts, the first 2000 features being the first part regarded as the local data, the next 2000 features being the second part, and the remaining 1000 as the third part. Note that a9a is small in terms of the number of features and gisette has a relatively higher dimensional feature space.

A prototype system is implemented in Python to verify our proposed algorithm. Specifically, we use optimization library from scipy to handle the optimization subproblems. We apply L-BFGS-B algorithm to do the update in (6) and entry-wise optimization for in (7). We run the experiment on a machine equipped with Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz and 128 GB of memory.

We compare our algorithm against an SGD based algorithm proposed in Hu et al. (2019). We keep track of the training objective value (log loss plus the regularizer), the testing log loss for each epoch for different datasets and parameter settings. We also test our algorithm with different levels of Gaussian noise added. In the training procedure, we initialize the elements in , and with while we initialize the parameter for the SGD-based algorithm with random numbers.

Fig. 1 and Fig. 2 show a typical trace of the training objective and testing log loss against epochs for a9a and gisette, respectively. On a9a, the ADMM algorithm is slightly slower than the SGD based algorithm, while they reach the same testing log loss in the end. On gisette, the SGD based algorithm converges slowly while the ADMM algorithm is efficient and robust. The testing log loss from the ADMM algorithm quickly converges to 0.08 after a few epochs, but the SGD based algorithm converges to only 0.1 with much more epochs. This shows that the ADMM algorithm is superior when the number of features is large. In fact, for each epoch, the update is a trivial quadratic program and can be efficiently solved numerically. The update contains optimization over computationally expensive functions, but for each sample, it is always an optimization over a single scalar so that it can be solved efficiently via scalar optimization and scales with the number of features.

Fig. 3 shows the testing loss for ADMM with different levels of Gaussian noise added. The other two baselines are the logistic regression model trained over all the features (in a centralized way) and that trained over only the local features in the first party. The baselines are trained with the built-in logistic regression function from sklearn library. We can see that there is a significant performance boost if we employ more features to help training the model on Party 1. Interestingly, in Fig. 3(b), the ADMM sharing has even better performance than the baseline trained with all features with sklearn. It further shows that the ADMM sharing is better at datasets with a large number of features.

Moreover, after applying moderate random perturbations, the proposed algorithm can still converge in a relatively small number of epochs, as Fig. 1(b) and Fig. 2(b) suggest, although too much noise may ruin the model. Therefore, ADMM sharing algorithm under moderate perturbation can improve the local model and the privacy cost is well contained as the algorithm converges in a few epochs.

6 Conclusion

We study learning over distributed features (vertically partitioned data) where none of the parties shall share the local data. We propose the parallel ADMM sharing algorithm to solve this challenging problem where only intermediate values are shared, without even sharing model parameters. We have shown the convergence for convex and non-convex loss functions. To further protect the data privacy, we apply the differential privacy technique in the training procedure to derive a privacy guarantee within

epochs. We implement a prototype system and evaluate the proposed algorithm on two representative datasets in risk minimization. The result shows that the ADMM sharing algorithm converges efficiently, especially on dataset with large number of features. Furthermore, the differentially private ADMM algorithm yields better prediction accuracy than model trained from only local features while ensuring a certain level of differential privacy guarantee.


  • (1)
  • Lib ([n.d.]) [n.d.]. LIBSVM Data: Classification (Binary Class). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Accessed: 2019-05-23.
  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 308–318.
  • Bonte and Vercauteren (2018) Charlotte Bonte and Frederik Vercauteren. 2018. Privacy-Preserving Logistic Regression Training. Technical Report. IACR Cryptology ePrint Archive 233.
  • Boyd et al. (2011) Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3, 1 (2011), 1–122.
  • Chaudhuri and Monteleoni (2009) Kamalika Chaudhuri and Claire Monteleoni. 2009. Privacy-preserving logistic regression. In Advances in neural information processing systems. 289–296.
  • Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. 2011. Differentially private empirical risk minimization. Journal of Machine Learning Research 12, Mar (2011), 1069–1109.
  • Dai et al. (2018) Wenrui Dai, Shuang Wang, Hongkai Xiong, and Xiaoqian Jiang. 2018. Privacy preserving federated big data analysis. In Guide to Big Data Applications. Springer, 49–82.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
  • Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19.
  • Dwork and Nissim (2004) Cynthia Dwork and Kobbi Nissim. 2004. Privacy-preserving datamining on vertically partitioned databases. In Annual International Cryptology Conference. Springer, 528–544.
  • Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
  • Evfimievski et al. (2004) Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and Johannes Gehrke. 2004. Privacy preserving mining of association rules. Information Systems 29, 4 (2004), 343–364.
  • Gilad-Bachrach et al. (2016) Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. 2016. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning. 201–210.
  • Gratton et al. (2018) Cristiano Gratton, Venkategowda Naveen KD, Reza Arablouei, and Stefan Werner. 2018. Distributed Ridge Regression with Feature Partitioning. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers. IEEE, 1423–1427.
  • Guyon et al. (2005) Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. 2005.

    Result analysis of the NIPS 2003 feature selection challenge. In

    Advances in neural information processing systems. 545–552.
  • Hamm et al. (2016) Jihun Hamm, Yingjun Cao, and Mikhail Belkin. 2016. Learning privately from multiparty data. In International Conference on Machine Learning. 555–563.
  • Heinze-Deml et al. (2018) Christina Heinze-Deml, Brian McWilliams, and Nicolai Meinshausen. 2018.

    Preserving Differential Privacy Between Features in Distributed Estimation.

    stat 7 (2018), e189.
  • Ho et al. (2013) Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems. 1223–1231.
  • Hong et al. (2016) Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. 2016. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization 26, 1 (2016), 337–364.
  • Hsieh et al. (2017) Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R Ganger, Phillip B Gibbons, and Onur Mutlu. 2017. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.. In NSDI. 629–647.
  • Hu et al. (2019) Yaochen Hu, Di Niu, Jianming Yang, and Shengping Zhou. 2019. FDML: A Collaborative Machine Learning Framework for Distributed Features. In Proceedings of KDD ’19. ACM.
  • Huang et al. (2018) Zonghao Huang, Rui Hu, Yanmin Gong, and Eric Chan-Tin. 2018. DP-ADMM: ADMM-based Distributed Learning with Differential Privacy. arXiv preprint arXiv:1808.10101 (2018).
  • Kenthapadi et al. (2012) Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. 2012. Privacy via the johnson-lindenstrauss transform. arXiv preprint arXiv:1204.2606 (2012).
  • Kenthapadi et al. (2013) Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. 2013. Privacy via the Johnson-Lindenstrauss Transform. Journal of Privacy and Confidentiality 5 (2013).
  • Kikuchi et al. (2018) Hiroaki Kikuchi, Chika Hamanaga, Hideo Yasunaga, Hiroki Matsui, Hideki Hashimoto, and Chun-I Fan. 2018.

    Privacy-preserving multiple linear regression of vertically partitioned real medical datasets.

    Journal of Information Processing 26 (2018), 638–647.
  • Li et al. (2014a) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014a. Scaling Distributed Machine Learning with the Parameter Server.. In OSDI, Vol. 14. 583–598.
  • Li et al. (2014b) Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. 2014b. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems. 19–27.
  • Mohassel and Zhang (2017) Payman Mohassel and Yupeng Zhang. 2017. SecureML: A system for scalable privacy-preserving machine learning. In 2017 38th IEEE Symposium on Security and Privacy (SP). IEEE, 19–38.
  • Pathak et al. (2010) Manas Pathak, Shantanu Rane, and Bhiksha Raj. 2010. Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems. 1876–1884.
  • Rajkumar and Agarwal (2012) Arun Rajkumar and Shivani Agarwal. 2012.

    A differentially private stochastic gradient descent algorithm for multiparty classification. In

    Artificial Intelligence and Statistics. 933–941.
  • Sarwate and Chaudhuri (2013) Anand D Sarwate and Kamalika Chaudhuri. 2013. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE signal processing magazine 30, 5 (2013), 86–94.
  • Shi et al. (2014) Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. 2014. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing 62, 7 (2014), 1750–1761.
  • Shokri and Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. ACM, 1310–1321.
  • Stolpe et al. (2016) Marco Stolpe, Hendrik Blom, and Katharina Morik. 2016. Sustainable Industrial Processes by Embedded Real-Time Quality Prediction. In Computational Sustainability. Springer, 201–243.
  • Takabi et al. (2016) Hassan Takabi, Ehsan Hesamifard, and Mehdi Ghasemi. 2016. Privacy preserving multi-party machine learning with homomorphic encryption. In 29th Annual Conference on Neural Information Processing Systems (NIPS).
  • Vaidya and Clifton (2002) Jaideep Vaidya and Chris Clifton. 2002. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 639–644.
  • Vaidya and Clifton (2003) Jaideep Vaidya and Chris Clifton. 2003.

    Privacy-preserving k-means clustering over vertically partitioned data. In

    Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 206–215.
  • Wan et al. (2007) Li Wan, Wee Keong Ng, Shuguo Han, and Vincent Lee. 2007. Privacy-preservation for gradient descent methods. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 775–783.
  • Wang et al. (2019) Yu Wang, Wotao Yin, and Jinshan Zeng. 2019. Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Computing 78, 1 (2019), 29–63.
  • Ying et al. (2018) Bicheng Ying, Kun Yuan, and Ali H Sayed. 2018. Supervised Learning Under Distributed Features. IEEE Transactions on Signal Processing 67, 4 (2018), 977–992.
  • Zhang et al. (2019) Chunlei Zhang, Muaz Ahmad, and Yongqiang Wang. 2019. Admm based privacy-preserving decentralized optimization. IEEE Transactions on Information Forensics and Security 14, 3 (2019), 565–580.
  • Zhang and Zhu (2016a) Tao Zhang and Quanyan Zhu. 2016a. A dual perturbation approach for differential private admm-based distributed empirical risk minimization. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. ACM, 129–137.
  • Zhang and Zhu (2016b) Tao Zhang and Quanyan Zhu. 2016b. Dynamic differential privacy for ADMM-based distributed classification learning. IEEE Transactions on Information Forensics and Security 12, 1 (2016), 172–187.
  • Zhang and Zhu (2017) Tao Zhang and Quanyan Zhu. 2017. Dynamic differential privacy for ADMM-based distributed classification learning. IEEE Transactions on Information Forensics and Security 12, 1 (2017), 172–187.
  • Zhang et al. (2018) Xueru Zhang, Mohammad Mahdi Khalili, and Mingyan Liu. 2018. Improving the Privacy and Accuracy of ADMM-Based Distributed Algorithms. In International Conference on Machine Learning. 5791–5800.
  • Zhou et al. (2010) Minqi Zhou, Rong Zhang, Wei Xie, Weining Qian, and Aoying Zhou. 2010. Security and privacy in cloud computing: A survey. In 2010 Sixth International Conference on Semantics, Knowledge and Grids. IEEE, 105–112.
  • Zhou et al. (2016) Yi Zhou, Yaoliang Yu, Wei Dai, Yingbin Liang, and Eric Xing. 2016. On convergence of model parallel proximal gradient algorithm for stale synchronous parallel system. In Artificial Intelligence and Statistics. 713–722.

7 Supplementary Materials

7.1 Proof of Theorem 1

To help theoretical analysis, we denote the objective functions in (6) and (7) as


correspondingly. We prove the following four lemmas to help prove the theorem.

Lemma 2

Under Assumption 1, we have


Proof. By the optimality in (7), we have

Combined with (8), we can get


Combined with Assumption 1.1, we have


Lemma 3

We have



Lemma 4

Suppose Assumption 1 holds. We have

Proof. The LFH can be decomposed into two parts as


For the first term, we have


For the second term, we have

(by optimality condition for subproblem in (6) and (7))

Note that we have abused the notation and denote it as the subgradient when is non-smooth but convex. Combining (22), (23) and (24), the lemma is proved.

Lemma 5

Suppose Assumption 1 holds. Then the following limit exists and is bounded from below: