Differentially Private Decentralized Learning

Decentralized learning has received great attention for its high efficiency and performance. In such systems, every participant constantly exchanges parameters with each other to train a shared model, which can put him at the risk of data privacy leakage. Differential Privacy (DP) has been adopted to enhance the Stochastic Gradient Descent (SGD) algorithm. However, these approaches mainly focus on single-party learning, or centralized learning in the synchronous mode. In this paper, we design a novel DP-SGD algorithm for decentralized learning systems. The key contribution of our solution is a topology-aware optimization strategy, which leverages the unique network characteristics of decentralized systems to effectively reduce the noise scale and improve the model usability. Besides, we design a novel learning protocol for both synchronous and asynchronous decentralized systems by restricting the sensitivity of the SGD algorithm and maximizing the noise reduction. We formally analyze and prove the DP requirement of our proposed algorithms. Experimental evaluations demonstrate that our algorithm achieves a better trade-off between usability and privacy than prior works.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/21/2020

A(DP)^2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

As deep learning models are usually massive and complex, distributed lea...
09/08/2018

Decentralized Differentially Private Without-Replacement Stochastic Gradient Descent

While machine learning has achieved remarkable results in a wide variety...
06/19/2020

Differentially Private Variational Autoencoders with Term-wise Gradient Aggregation

This paper studies how to learn variational autoencoders with a variety ...
07/13/2020

Privacy Amplification via Random Check-Ins

Differentially Private Stochastic Gradient Descent (DP-SGD) forms a fund...
10/22/2021

Differentially Private Coordinate Descent for Composite Empirical Risk Minimization

Machine learning models can leak information about the data used to trai...
11/27/2018

LEASGD: an Efficient and Privacy-Preserving Decentralized Algorithm for Distributed Learning

Distributed learning systems have enabled training large-scale models ov...
12/09/2020

Privacy Amplification by Decentralization

Analyzing data owned by several parties while achieving a good trade-off...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs) have become one of the most popular and powerful machine learning methods for a wide range of artificial intelligent tasks. As modern DNN models are more complicated and require more training efforts, distributed learning has gained a lot of popularity, where multiple distributed participants work collaboratively on a training task

[22, 15, 3]. Distributed learning can be divided into two categories: centralized and decentralized learning [17]

. A centralized learning system utilizes a centralized server to collect and aggregate estimates (i.e., model parameters) of participants at each iteration; while in a decentralized system, each participant exchanges estimates with its neighbors to reach consensus on the DNN model.

Distributed learning can prevent direct privacy leakage as each participant keeps its own private dataset locally during the training process. However, it still faces threats of indirect privacy leakage. Participants in distributed systems constantly exchange estimates, which may contain information about their private training data. Past works have demonstrated the feasibility and severity of model inversion attacks [10, 29, 9] and membership inference attacks [20] in the distributed learning setting.

To mitigate such privacy threats in distributed training, one promising solution is Differential Privacy (DP). DP was originally introduced to preserve the privacy of individual data records in statistical databases [5]

. A number of studies have then applied DP to enhance the privacy of deep learning (DL) in different environments

[1, 28, 16, 26, 12]

. Generally, DL training usually adopts Stochastic Gradient Descent (SGD) to iteratively minimize the loss function and identify the optimal model parameters. Then these DP-SGD algorithms inject additional noise at each iteration. By carefully restricting the sensitivity of the SGD and tracking the accumulated privacy loss, DP-SGD can guarantee the DP of deep models.

For DP solutions, there exists a trade-off between privacy and usability, determined by the noise scale added during training. Adding too much noise can meet the privacy requirements. However, it can also decrease the model accuracy. As a result, it is critical to identify the minimal amount of noise that can provide desired privacy protection, and also maintain acceptable model performance. In this paper, we aim to propose a novel DP-SGD algorithm for decentralized learning systems, that can efficiently reduce the required noise scale for DP guarantee, compared to prior works.

I-a Related Works

DP techniques for deep learning.  Existing DP-SGD algorithms adopt additive noise mechanisms by adding random noise to the gradients during the training process. To improve the model usability and guarantee DP, these algorithms usually restrict the sensitivity of randomized mechanisms. Abadi et al. [1] bounded the influence of training samples on gradients by clipping each gradient in norm below a given threshold. Yu et al. [26] optimized the model accuracy by adding decay noise to the gradients over the training time since the learned models converge iteratively.

Another way to improve the model usability lies in precisely tracking the overall privacy cost of the training process. Shokri et al. [24] and Wei et al. [25] composed the additive noise mechanisms using the advanced composition theorem [6], leading to a linear increase in the privacy budget. In [1, 2, 11, 14]

, moments account (MA) was used to reduce the added noise by keeping track of a bound on the moments of the privacy loss during the training process. Other algorithms

[23, 13, 26] were designed to improve the model usability using (zero) concentrated DP [7]

, based on the observation that the privacy loss of an additive noise mechanism follows a sub-Gaussian distribution.

Applying DP to distributed learning systems.  For centralized learning systems, some works [24, 2, 11, 14] applied the above techniques to preserve the privacy of the training data for each participant. For decentralized learning systems, several DP algorithms [27, 28, 16, 4] were also proposed using the DP techniques. However, those decentralized solutions are restricted to the Alternating Direction Method of Multipliers algorithm, and cannot be used with the mainstream SGD. The only state-of-the-art solution for DP-SGD in decentralized systems is [16], which optimized the DP mechanism with an advanced composition theorem for tracking the accumulated privacy loss. This serves as the baseline for comparison with our solution.

I-B Contribution

We design a new DP-SGD solution for decentralized learning with following contributions.

Higher usability.  All existing works focused on restricting the sensitivity of the SGD algorithm in order to improve the model usability, which seems to have reached the performance limit. In contrast, we propose a novel topology-aware technique, which can leverage network features of decentralized systems to optimize the randomized mechanism. This can effectively reduce the noise scale and improve the model usability. In addition, we also apply the noise decay technique from the single-party training mode to the decentralized system, to further optimize the DP protection.

More applicability.  Due to the discrepancy of network bandwidth or unpredictable system faults, asynchronous decentralized systems are pervasive [18, 19]. Unfortunately, existing DP algorithms usually assume global synchronization, which makes them vulnerable to the fluctuations in decentralized systems. For instance, the training task has to be suspended or slowed down under poor connection among participants.

Our solution considers both synchronous and asynchronous decentralized learning. We introduce a novel learning protocol, where the agent calculates and sends different aggregated estimates to neighbors. This protocol can maximize the noise reduction from the topology-aware technique. Meanwhile, it can perfectly adapt to the asynchronous learning mode.

Formal privacy analysis and comprehensive experimental evaluation.  We formally prove that our solution can guarantee DP for all participants. We demonstrate the benefits brought by our optimization techniques. We conduct extensive experiments to show the superior of our method over prior works under various system settings.

Ii Preliminaries

Ii-a Decentralized Systems

We consider a decentralized system whose communication topology can be represented as an undirected graph: . denotes a set of participates (or agents) in this decentralized network. represents the set of communication links among the agents, with the following two properties: (1) if and only if agent can receive information from agent ; (2) if .

In this decentralized learning system, the agents cooperatively train a model by optimizing the loss function with SGD and exchanging estimates with their neighbors. Let be the

-dimensional estimate vector of a DL model, and

be the loss function. Each agent obtains a private training dataset , consisting of independent and identically distributed (i.i.d.) data samples from a distribution . Those agents train a shared model by solving the optimization problem: , where is a training data sample from .

During the training process, agent updates its local estimate iteratively, and sends to its neighbors . In the synchronous mode, agent needs to receive all the estimates from its neighbors before updating its local estimate. In the asynchronous mode, some neighbors are not able to communicate with agent at certain iterations due to the low bandwidth or system crash. Agent can only collect the estimates from part of its neighbors. To adapt to both synchronous and asynchronous modes as well as maintaining the convergence rate, agent (1) first asks each neighbor whether to participate at each iteration; (2) randomly selects a neighbor from the response neighbors ; (3) utilizes the following update rule [18, 8] to calculate the local estimate:

(1)

where is a hyper-parameter determining the weight of the local estimate; is the learning rate; is the stochastic gradient with . The gradient can also be replaced by a mini-batch of stochastic gradients [17, 18].

Ii-B Differential Privacy

DP is a rigorous mathematical framework to protect the privacy of individual records in a database when the aggregated information about this database is shared among untrusted parties [5]. In a decentralized system, we assume all agents are honest-but-curious. Our goal is to adopt DP to protect the training data privacy of each agent. Decentralized learning with DP is formally defined below:

Definition 1.

(Decentralized Learning with DP) For each agent , a randomized mechanism with domain and range satisfies -DP if for any two adjacent datasets and any subset of outputs , the following property is held:

(2)

is restricted by two parameters: and . is the privacy budget of agent to limit the privacy loss of training data. is a relaxation parameter that allows the privacy budget of to exceed

with probability

. A decentralized learning system is differentially private if for , satisfies -DP. Each agent can set its own privacy budget. Alternatively, the entire system can enforce a uniform privacy budget for all agents.

To achieve differentially private decentralized learning, a common and straightforward way is to use additive noise mechanisms at each iteration [12]. Specifically, we use Gaussian mechanism and denote as the noise parameter of agent . At each iteration, agent adds the Gaussian noise, , to the updated local estimate to guarantee differential privacy (Eq. 3). Then, sends to its neighbors.

(3)

Iii An Optimized DP-SGD Algorithm

As shown in Eq. 3, the random noise added into the aggregated estimate must be large enough to satisfy the privacy requirement. However, adding too much noise can affect the model accuracy. So it is important to balance this trade-off. We propose a novel DP-SGD algorithm, which can reduce the amount of noise for each agent to improve the usability of trained models, without violating the DP requirement. Our algorithm is general-purpose, and can be applied to both synchronous and asynchronous modes. It consists of two strategies and one learning protocol, as described below.

Iii-a Strategy 1: Topology-aware Noise Reduction

Existing DP-SGD algorithms all assume that the required noise scale only depends on the agents themselves. In decentralized systems, the communication topology can affect the amount of noise as well. Our proposed strategy is able to reduce the noise scale of each agent when considering its connectivity with its neighbors. The key insight of our approach is that the received estimates from other neighbors also contain certain noise, which can contribute to the noise scale of the aggregated estimate, thus reducing the amount of noise added by the agent itself.

t

Fig. 1: An illustrative example of topology-aware noise reduction.

Fig. 1 gives an illustrative example. We consider an agent with four neighbors, where and are connected as well. When obtains all estimates of its neighbors, we assume it picks the estimate of for aggregation with its own estimate and gradient. Since the received also includes Gaussian noise , then the aggregated estimate following Eq. 3 will have the corresponding random component . As a result, when generating the estimate for agent or , does not need to add the full-scale noise . It only needs to inject the noise such that , which can meet the DP requirement, but reduce the actual amount of noise.

It is worth noting that the noise scale is not applicable when generating estimates for or . For , since it already knows its own parameter , then is not random noise anymore. It is similar for as it receives from . Then for these two agents, we can pick another agent (e.g., or ) and generate a different estimate for them with still reduced noise scale.

Formally, given an agent , for each of its neighbors , we define , which is the set of ’s neighbors that are not connected to (or itself). For instance, in Fig. 1, we have , and . This also means that can be used in the aggregation for all agents in with the reduced noise scale. Then our goal is to find a minimal set , such that using the agents inside this set for aggregation can cover all the neighbor agents of . Note that there can exist a neighbor that is connected to every neighbor in . Then we cannot find an non-adjacent neighbor to cover it, and should exclude it from . This process is described in Eq. 4. We will solve it approximately in Section III-C.

(4)

After identifying , for , if is connected to every neighbor in (i.e., ), then agent just sends the local estimate with full-scale noise to . Otherwise, there exists at least one neighbor such that . Then the noise scale from to , , should satisfy Eq. 5(a) in order to guarantee the DP requirement against , where are the full-scale noise. According to the additivity of Gaussian distribution, we calculate the noise parameter via Eq. 5(b). With this reduced noise scale, agent can update the estimate for agent based on Eq. 5(c).

(5a)
(5b)
(5c)

Iii-B Strategy 2: Time-aware Noise Decay

This technique was originally proposed in [26], to optimize the DP protection of model training in one-party systems. Here we apply this technique to decentralized systems. The key idea is that the model converges and the norm of gradients decreases as the training iteration increases. Thus, the sensitivity of the Gaussian mechanism decreases, allowing us to inject less noise to the gradients. Note that the training datasets are distributed in different agents, all agents in the decentralized system should reach a consensus on the noise decay schedule to tolerate the differences in the datasets.

Specifically, compared to the aggregation process in Eq. 5(c), our first modification is to clip the gradients in norm to bound their size at each training iteration. We follow the method from [1]: given a clipping threshold , the clipped gradient vector is bounded by , as shown in Eq. 6(a).

Our second modification is to dynamically reduce the noise scale over the training time. Without loss of generality, we use step decay to reduce the noise scale every few epochs. Let

be the initial noise parameter of agent . The noise parameter of agent at -th iteration is shown in Eq. 6(b), where is the reduction factor and is the reduction step of noise decay.

(6a)
(6b)

Iii-C Learning Protocol

With the two novel strategies, we now describe the end-to-end protocol, as shown in Algorithm LABEL:alg:synchronous.

Synchronous training. The algorithm takes as input the initial estimate , initial noise parameter , learning rate , and number of iterations . Before training, agent sends to and receives from its neighbors (Line LABEL:line:broadcast). For , agent computes the neighbor set . Then, agent updates the initial estimate and sends to its neighbors (Lines LABEL:line:initial-LABEL:line:initial_end). At -th iteration, agent first computes the full-scale noise parameter using the time-aware noise decay strategy (Line LABEL:line:decay). Afterwards, agent generates estimates for its neighbors and updates its local estimate using the topology-aware noise reduction strategy.

To approximately solve Eq. 4, agent continuously selects an agent from exclusively until all agents are traversed or is found. For , agent computes the estimate and sends it to (Lines LABEL:line:network-LABEL:line:network_end). The complexity of the approximate method is . Then, agent randomly selects a neighbor and updates its local noised estimate (Lines LABEL:line:sample-LABEL:line:update). If there are still uncovered neighbors, sends its local estimate to the neighbors (Lines LABEL:line:if-LABEL:line:if_end). After iterations, Algorithm LABEL:alg:synchronous returns the final differentially private deep model.

Asynchronous training. Compared to the synchronous system, the neighbors may not participate in certain iterations in the asynchronous setting. To tolerate such situation, right before each iteration, agent first asks its neighbors whether they will participate in this iteration and gets the response neighbors (Line LABEL:line:response). Then it will continue the same procedure as synchronous training, with only the neighbors in this iteration.

algocf[t!]    

Iv Privacy Analysis

We perform a formal analysis about Algorithm LABEL:alg:synchronous from the aspects of privacy and efficiency.

Iv-a Proof of DP

First, we prove Algorithm LABEL:alg:synchronous is differentially private by carefully choosing the initial noise parameters. We track the accumulated privacy loss of the training process using a modern DP technique, Rényi DP [21], which ensures a sublinear loss of privacy as a function of the number of iterations.

Theorem 1.

Let the number of iterations be . For any decentralized system and every agent , the randomized mechanisms in Algorithm LABEL:alg:synchronous is ()-DP if we choose

(7)
Proof.

We prove the theroem in the synchronous mode and ignore the time-aware noise decay strategy since it does not incur any additional privacy loss [26]. We clip the gradients in norm of and assume the privacy budget is the same at each iteration. According to the Gaussian mechanism [5], the update rule in Line LABEL:line:update is ()-DP at one iteration if we choose

Using Rényi composition theorem [21], our new update rule is ()-DP after iterations if we choose . Then, we have . Combining the above equations, we conclude that our update rule in Line LABEL:line:update is ()-DP if we choose such that:

(8)

We have proven that the local estimate of agent is differentially private during the training process. Then, we prove that for , the estimates generated for is also differentially private. Let be the selected agent for generating estimate for . Since , are not directly connected, the noise of can be used as a random component to guarantee the DP of against . Thus, Because all agents generate noise independently,the noise scale for should satisfy

(9)

According to the additivity of Gaussian distribution, the noise parameter for the estimate for is . Therefore, in Algorithm LABEL:alg:synchronous, the estimates generated for ’s neighbors are also differentially private. ∎

Iv-B Efficiency Analysis

We further analyze the efficiency introduced by noise reduction when considering the communication topology. Without loss of generality, we assume for and . Let be the noise parameter of at iteration . According to the proposed topology-aware noise reduction strategy,

(10)

Thus, compared with the full-scale noise parameter, the noise added to is reduced by a factor of . We observe that decreases as decreases. When approaches 0, the noise of the estimates that agent sends to/receives from its neighbors would be significantly reduced.

V Experiments

V-a Implementation and Experimental Setup

Dataset and DNN model.  We conduct the experiments on the MNIST dataset. It consists of a training set of 60k samples and a test set of 10k samples. We consider a fully connected network with a hidden layer of size 100 for image classification. We set a fading learning rate with the initial value of 0.05. Our algorithm is general and can be applied to other DNN tasks as well. Results on Cifar10 can be found in the supplementary material.

For the implementation of the decentralized system, we consider a network consisting of 30 agents, and each agent connects to others with the probability of 0.2 (connection rate). This decentralized system is guaranteed to be fully connected, i.e., there exists at least one path connecting two arbitrary agents. The training set of each agent is independent and identically distributed with the same size. In the synchronous mode, all 30 agents participate at each training iteration. In the asynchronous mode, we assume 10% of random agents will not be involved at each iteration.

Without loss of generality, the agents have same privacy budget (1.0) and relaxation hyper-parameter (). We assume the agents reach the consensus on the time-aware noise decay strategy, where and are 0.9 and 1000, respectively. We clip the gradients in norm of 4.0.

Baselines and metrics.  We consider different decentralized learning algorithms in our experiments:

  • No Noise: the agents exchange parameters without DP protection.

  • Li18: the DP-SGD algorithm proposed by Li et al. [16].

  • Li18+MA: we integrate Li18 with moments account [1] to track the accumulated privacy loss.

  • Proposed: our proposed algorithm (Algorithm LABEL:alg:synchronous).

It is worth noting the first three solutions cannot be applied to the asynchronous mode directly. For fair comparisons, we modify their update rules as Eq. 3 to follow our learning protocol for asynchronous learning. For each algorithm, we measure the testing accuracy of each agent’s model at every iteration during the training, and report the average accuracy.

V-B Effectiveness of Proposed

We evaluate and compare the performance of those DP-SGD algorithms under different settings in both synchronous and asynchronous modes.

Epoch v.s. accuracy.  Figs. 2 and 3 illustrate the trend of average testing accuracy in the training process with different values. First, we observe that our proposed algorithm outperforms Li18 and Li18+MA, and is closer to the No Noise case, for different values and modes. Such advantage is more obvious with a smaller , as the reduced noise is larger. Second, Li18+MA has higher performance than Li18 because of the usage of MA. Different from our solution, the usability of the models from Li18 and Li18+MA decreases as decreases. This is caused by the increase of the noise of the selected estimates. Third, the model training in synchronous mode converges slightly faster than the one in the asynchronous mode, since each participant can contribute to the model training to accelerate the process.

Privacy budget v.s. accuracy.  We consider the impact of privacy budget on the model accuracy, as shown in Fig. 4. We can observe our solution can beat the other two DP solutions for different privacy budgets. Besides, when the privacy budget decreases, the model usability decreases, as more noise is required to inject to the estimates. Meanwhile, the advantage of our solution also increases, as the amount of reduced noise increases as well. This indicates that our algorithm is more effective when a small privacy budget is needed.

(a)
(b)
Fig. 2: The average accuracy of the agents with different values under synchronous settings.
(a)
(b)
Fig. 3: The average accuracy of the agents with different values under asynchronous settings.
(a) Synchronous
(b) Asynchronous
Fig. 4: The average accuracy of the agents as the privacy budget increases.

V-C Effectiveness of Topology-aware Noise Reduction

Our DP-SGD algorithm is composed of two strategies: topology-aware noise reduction (NR) and noise-aware noise decay (ND). We evaluate the integration of these two strategies in the main paper. In this section, we measure the effectiveness of NR only. Figures 5 and 6 illustrate the performance comparison between NR and other DP-SGD algorithms.

We observe that NR outperforms Li18+MA in both synchronous and asynchronous modes, and the advantage is more significant when is smaller. In the synchronous mode, NR almost has the same performance as NR+ND at the first 20 epochs, as the reduced noise from ND strategy is quite small at the first two reduction steps (the noise is not reduced at the first reduction step). With more epochs, NR+ND is slightly better than NR only, caused by the effectiveness of ND. In the asynchronous mode, NR almost have the same performance as NR+ND especially when equals 0.25.

V-D Impact of Connection Rate

We set the connection rate of the decentralized network as 0.2 in the main paper. Our proposed algorithm is effective with different connection rate as well. In this section, we measure and compare the performance of different DP-SGD algorithms with the connection rate of 0.1 and 0.4. Without loss of generality, we consider the synchronous mode and set as 0.25. Figure 7 shows the average accuracy of the agents with these two connection rates as the training epoch increases. We observe that the performance of each algorithm does not change with different connection rates. The underlying reason may be that although the number of an agent’s neighbors is changed with the connection rate, the agent still selects one estimate for updates at each iteration. Then the training result will not be changed as well. As such, our proposed solution can exhibit advantages over prior works under various network connection rates.

(a)
(b)
Fig. 5: The effectiveness of topology-aware noise reduction with different values under synchronous settings.
(a)
(b)
Fig. 6: The effectiveness of topology-aware noise reduction with different values under asynchronous settings.

V-E Impact of Aggregation Weight

We evaluate the performance of these algorithms with a larger (0.75) and smaller (0.125) values in the synchronous mode. The average accuracy of the agents is shown in Figure 8. The experimental results give the same conclusion as in Section V.B: a smaller value leads to better improvement from our proposed solution.

(a) Connection Rate = 0.1
(b) Connection Rate = 0.4
Fig. 7: The average accuracy of the agents with different connection rates under synchronous settings.
(a)
(b)
Fig. 8: The average accuracy of the agents with different values under synchronous settings.
(a) Synchronous
(b) Asynchronous
Fig. 9: The average accuracy of the agents in different modes on CIFAR10.

V-F Results on CIFAR10

We also evaluate the DP-SGD algorithms on a more complicated training task over CIFAR10 dataset. The model to be trained is a Convolutional Neural Network, consisting of two max–pooling layers and three fully connected layers. The system settings and configurations are the same as the ones on MNIST. We set

and the connection rate as 0.25 and 0.2.

Figure 9 illustrates the experimental results in the synchronous and asynchronous modes. We observe that our solution (Proposed) outperforms prior DP-SGD algorithms and approaches the baseline (No Noise) as the training epoch increases in both of the two modes. Li18 and Li18+MA even do not converge in the presence of Gaussian noise. The reason is that each parameter in the model needs to be appended with random noise to satisfy DP requirement. When the model becomes more complicated with more parameters, the overall divergence between the original model and the DP-protected model becomes larger, making it hard to converge. This scenario will never happen in our solution.

Vi Conclusion

In this paper, we proposed a novel DP-SGD algorithm for decentralized learning systems. We introduced a topology-aware noise reduction technique, which can leverage the network topology to reduce the noise scale and improve model usability while still satisfying the DP requirement. We applied the time-scale noise decay technique to the decentralized systems to further optimize the model performance. We designed a learning protocol, which enables the topology-aware technique and also adapts to the synchronous and asynchronous learning modes. To the best of our knowledge, this is the first study to utilize network topology to optimize DP algorithm, and deploy DP protection to asynchronous decentralized systems. Formal analysis proves our method can guarantee the privacy requirement, and empirical evaluations indicate our solution achieves better trade-offs between privacy and usability under different system configurations.

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §I-A, §I-A, §I, §III-B, 3rd item.
  • [2] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers (2018) Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984. Cited by: §I-A, §I-A.
  • [3] K. S. Chahal, M. S. Grover, K. Dey, and R. R. Shah (2020) A hitchhiker’s guide on distributed training of deep neural networks. Journal of Parallel and Distributed Computing 137, pp. 65–76. Cited by: §I.
  • [4] J. Ding, Y. Gong, C. Zhang, M. Pan, and Z. Han (2019) Optimal differentially private admm for distributed machine learning. arXiv preprint arXiv:1901.02094. Cited by: §I-A.
  • [5] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor (2006) Our data, ourselves: privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486–503. Cited by: §I, §II-B, §IV-A.
  • [6] C. Dwork, G. N. Rothblum, and S. Vadhan (2010) Boosting and differential privacy. In IEEE Annual Symposium on Foundations of Computer Science, pp. 51–60. Cited by: §I-A.
  • [7] C. Dwork and G. N. Rothblum (2016) Concentrated differential privacy. arXiv preprint arXiv:1603.01887. Cited by: §I-A.
  • [8] S. Guo, T. Zhang, X. Xie, L. Ma, T. Xiang, and Y. Liu (2020) Towards Byzantine-resilient learning in decentralized systems. arXiv preprint arXiv:2002.08569. Cited by: §II-A.
  • [9] Z. He, T. Zhang, and R. B. Lee (2019) Model inversion attacks against collaborative inference. In Annual Computer Security Applications Conference, pp. 148–162. Cited by: §I.
  • [10] B. Hitaj, G. Ateniese, and F. Perez-Cruz (2017) Deep models under the GAN: information leakage from collaborative deep learning. In ACM SIGSAC Conference on Computer and Communications Security, pp. 603–618. Cited by: §I.
  • [11] N. Hynes, R. Cheng, and D. Song (2018) Efficient deep learning on multi-source private data. arXiv preprint arXiv:1807.06689. Cited by: §I-A, §I-A.
  • [12] B. Jayaraman and D. Evans (2019) Evaluating differentially private machine learning in practice. In USENIX Security Symposium, pp. 1895–1912. Cited by: §I, §II-B.
  • [13] B. Jayaraman, L. Wang, D. Evans, and Q. Gu (2018) Distributed learning without distress: privacy-preserving empirical risk minimization. In Advances in Neural Information Processing Systems, pp. 6343–6354. Cited by: §I-A.
  • [14] Y. Kang, Y. Liu, and W. Wang (2019) Weighted distributed differential privacy ERM: convex and non-convex. arXiv preprint arXiv:1910.10308. Cited by: §I-A, §I-A.
  • [15] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §I.
  • [16] C. Li, P. Zhou, L. Xiong, Q. Wang, and T. Wang (2018) Differentially private distributed online learning. IEEE Transactions on Knowledge and Data Engineering 30 (8), pp. 1440–1453. Cited by: §I-A, §I, 2nd item.
  • [17] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu (2017) Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330–5340. Cited by: §I, §II-A.
  • [18] X. Lian, W. Zhang, C. Zhang, and J. Liu (2018) Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pp. 3043–3052. Cited by: §I-B, §II-A.
  • [19] Q. Luo, J. He, Y. Zhuo, and X. Qian (2019) Heterogeneity-aware asynchronous decentralized training. arXiv preprint arXiv:1909.08029. Cited by: §I-B.
  • [20] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov (2019) Exploiting unintended feature leakage in collaborative learning. In IEEE Symposium on Security and Privacy, pp. 691–706. Cited by: §I.
  • [21] I. Mironov (2017) Rényi differential privacy. In IEEE Computer Security Foundations Symposium, pp. 263–275. Cited by: §IV-A, §IV-A.
  • [22] A. Nedic and A. Ozdaglar (2009) Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control 54 (1), pp. 48–61. Cited by: §I.
  • [23] M. Park, J. Foulds, K. Choudhary, and M. Welling (2017)

    DP-EM: differentially private expectation maximization

    .
    In Artificial Intelligence and Statistics, pp. 896–904. Cited by: §I-A.
  • [24] R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In ACM SIGSAC Conference on Computer and Communications Security, pp. 1310–1321. Cited by: §I-A, §I-A.
  • [25] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V. Poor (2020) Federated learning with differential privacy: algorithms and performance analysis. IEEE Transactions on Information Forensics and Security. Cited by: §I-A.
  • [26] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex (2019) Differentially private model publishing for deep learning. In IEEE Symposium on Security and Privacy, pp. 332–349. Cited by: §I-A, §I-A, §I, §III-B, §IV-A.
  • [27] T. Zhang and Q. Zhu (2016) Dynamic differential privacy for ADMM-based distributed classification learning. IEEE Transactions on Information Forensics and Security 12 (1), pp. 172–187. Cited by: §I-A.
  • [28] X. Zhang, M. M. Khalili, and M. Liu (2018) Improving the privacy and accuracy of ADMM-based distributed algorithms. In International Conference on Machine Learning, Cited by: §I-A, §I.
  • [29] L. Zhu, Z. Liu, and S. Han (2019) Deep leakage from gradients. In Advances in Neural Information Processing Systems, pp. 14747–14756. Cited by: §I.