Combining Differential Privacy and Byzantine Resilience in Distributed SGD

10/08/2021 ∙ by Rachid Guerraoui, et al. ∙ 0

Privacy and Byzantine resilience (BR) are two crucial requirements of modern-day distributed machine learning. The two concepts have been extensively studied individually but the question of how to combine them effectively remains unanswered. This paper contributes to addressing this question by studying the extent to which the distributed SGD algorithm, in the standard parameter-server architecture, can learn an accurate model despite (a) a fraction of the workers being malicious (Byzantine), and (b) the other fraction, whilst being honest, providing noisy information to the server to ensure differential privacy (DP). We first observe that the integration of standard practices in DP and BR is not straightforward. In fact, we show that many existing results on the convergence of distributed SGD under Byzantine faults, especially those relying on (α,f)-Byzantine resilience, are rendered invalid when honest workers enforce DP. To circumvent this shortcoming, we revisit the theory of (α,f)-BR to obtain an approximate convergence guarantee. Our analysis provides key insights on how to improve this guarantee through hyperparameter optimization. Essentially, our theoretical and empirical results show that (1) an imprudent combination of standard approaches to DP and BR might be fruitless, but (2) by carefully re-tuning the learning algorithm, we can obtain reasonable learning accuracy while simultaneously guaranteeing DP and BR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distributed machine learning (ML) has received significant attention in recent years due to the growing complexity of ML models and the increasing computational resources required to train them [11, 33]. One of the most popular distributed ML settings is the parameter server architecture, wherein multiple machines (called workers) jointly learn a single large model on their collective dataset with the help of a trusted server running the stochastic gradient descent (SGD) algorithm [7]

. In this scheme, the server maintains an estimate of the model parameters, which is iteratively updated using stochastic gradients computed by the workers.

Compared to its centralized counterpart, distributed SGD is more susceptible to security threats. One of them is related to the violation of data privacy by an honest-but-curious server [43]. Another one is the malfunctioning due to (what is called) Byzantine behavior of workers [21, 5]. In the past, significant progress has been made in addressing these issues separately. In the former case, -differential privacy

(DP) has become a dominant standard for preserving privacy in ML, especially when considering neural networks 

[13, 1]. In the latter case, -Byzantine resilience has emerged as the principal notion for demonstrating the Byzantine resilience (BR) of distributed SGD [5, 14]. Since DP and BR are two crucial pillars of distributed machine learning, practitioners will inevitably have to build systems satisfying both these requirements. It is thus natural to ask the question: Can we simultaneously ensure DP and BR in distributed ML?

In this paper, we take a first step towards a positive answer to this question by studying the resilience of the renowned DP-SGD algorithm [1] against Byzantine workers. More precisely, we consider distributed SGD where, in each learning step, the honest workers inject Gaussian noise to their gradients to ensure -DP, while the server updates the parameters by applying an -BR aggregation rule on the received gradients (to protect against Byzantine workers). Upon analyzing the convergence of this algorithm, we show that DP and BR can indeed be combined, but doing so is non-trivial. Our key contributions are summarized below.

1. Inapplicability of existing results from the BR literature.

We start by highlighting an inherent incompatibility between the supporting theory of -BR and the Gaussian mechanism used in DP-SGD. Specifically, we show (in Section 3.2) that the variance-to-norm (VN) condition, critical to guarantee -BR, cannot be satisfied when honest workers enforce -DP via Gaussian noise injection. Hence, existing results on the resilience of distributed SGD to Byzantine workers are not applicable when considering DP-SGD. More generally, this highlights limitations of many existing Byzantine resilient techniques in settings where the stochasticity of the gradients is non-trivial.

2. Adapting the theory of BR to account for DP.

To overcome the aforementioned shortcoming, we introduce a relaxation of the VN condition (in Section 3.3), namely the -approximated VN condition. By doing so, we (1) generalize existing results from the BR literature and (2) demonstrate approximate convergence of DP-SGD under Byzantine faults. Our convergence result can be roughly put as follows.

Theorem (Informal).

Let

be the loss function of the learning model, and

be the parameter vector obtained after

steps of our algorithm. If the -approximated VN condition holds true, then

where , and denotes the Euclidean norm.

As the aforementioned result suggests, a smaller ensures better convergence. To quantify this convergence guarantee, we present (in Section 3.4) necessary and sufficient conditions for the -approximated VN condition to hold. Specifically, we show that the condition holds only if where , , and denote the model size, batch size, and dataset size respectively. This showcases an important interplay between DP and BR, e.g., larger and leads to stronger resilience to Byzantine workers at the expense of weaker privacy.

3. From theoretical insights to practical convergence.

Figure 1: Impact of the batch size and aggregation rule on the cross-accuracy of DP-SGD against the little [3]

attack on Fashion-MNIST.

Importantly, our result (in Section 3.4) provides key insights on how to better integrate standard approaches to DP and BR using hyperparameter optimization (HPO), e.g., by increasing the batch size , or choosing an appropriate aggregation rule. The improvement is illustrated by a snippet of our experimental results in Figure 1. This finding is particularly interesting as these parameters have very little impact in most settings when considering DP or BR separately. We validate our theoretical insights in Section 4 through an exhaustive set of experiments on MNIST and Fashion-MNIST using neural networks.

Closely related prior works

There has been a long line of research on the interplay between DP and other notions of robustness in ML [12, 24, 31, 30, 34, 23, 27]. However, previous approaches do not apply to our setting for two main reasons; (1) they do not address the privacy of the dataset against an honest-but-curious server, and (2) their underlying notion of robustness are either weaker than or orthogonal to BR. Furthermore, recent works on the combination of privacy and BR in distributed learning either study a weaker privacy model than DP or provide only elementary analyses [9, 19, 29, 18]. We refer the interested reader to Appendix A for an in depth discussion of prior works. In short, we believe the present paper to be the first to provide an in-depth analysis with practical relevance on the integration of DP and BR in distributed learning.

2 Problem Setting and Background

Let be the space of data points. We consider the parameter server architecture with workers owning a common dataset of points. The workers seek to collaboratively compute a parameter vector that minimizes the empirical loss function defined as follows:

(1)

where is a point-wise loss function. We assume that function is differentiable and admits a non-trivial local minimum. In other words, admits a critical point, but it is not null everywhere. We also make the following standard assumptions.

Assumption 1 (Bounded norm).

There exists a finite real such that for all ,

Assumption 2 (Bounded variance).

There exists a real value such that for all ,

Assumption 3 (Smoothness).

There exists a real value such that for all ,

Assumptions 2 and 3 are classical to most optimization problems in machine learning [6]. Assumption 1

is merely used to avoid unnecessary technicalities, especially when studying differential privacy. In practice, it can be easily enforced by gradient clipping 

[1].

In an ideal setting, when all the workers are honest (i.e., non-Byzantine) and data privacy is not an issue, a standard approach to solving the above learning problem is the distributed implementation of the stochastic gradient descent (SGD) method. In this algorithm, the server maintains an estimate of the parameter vector which is updated iteratively by using the average of the gradient estimates sent by the workers. However, this algorithm is vulnerable to both privacy and security threats.

Threat model.

We consider the server to be honest-but-curious, and that some of the workers are Byzantine. An honest-but-curious server follows the prescribed algorithm correctly, but may infer sensitive information about workers’ data using their gradients and any other additional information that can be gathered during the learning as demonstrated by [43]. On the other hand, Byzantine workers need not follow the prescribed algorithm correctly and can send arbitrary gradients. For instance, they may either crash or even send adversarial gradients to prevent convergence of the algorithm [5].

2.1 Distributed SGD with Differentially Privacy

Over the last decade, differential privacy (DP) has become a gold standard in privacy-preserving data analysis [13]. Intuitively, a randomized algorithm is said to preserve DP if its executions on two adjacent datasets are indistinguishable. More formally, two datasets and are said to be adjacent if they differ by at most one sample. Then, -DP is defined as follows.

Definition 1 (-Dp).

Let , and an arbitrary output space. A randomized algorithm is -differentially private if for any two adjacent datasets , and any possible set of outputs ,

By far, the most widely used approach to ensure DP in machine learning is to use the differentially private version of SGD, called DP-SGD [32, 4, 1]. The distributed implementation of this scheme against an honest-but-curious server consists, at every step, in making the honest workers add Gaussian noise with variance to their stochastic gradients before sending them to the server. When is chosen appropriately (e.g., see Theorem 1), each learning step satisfies -DP at the worker level. Finally, the privacy guarantee of the overall learning procedure is obtained by using the composition property of DP [20, 1, 37]. However, we are mainly interested in studying the impact of per-step and per-worker privacy budget on the resilience of the algorithm to Byzantine workers.

2.2 Byzantine Resilience of Distributed SGD

In the presence of Byzantine workers, the server can no longer rely on the average of workers’ gradients to update the model parameters. Instead, it uses a gradient aggregation rule (GAR) that is resilient to incorrect gradients that may be sent by at most Byzantine workers. A standard notion for defining this resilience is -Byzantine resilience stated below, which was originally proposed by [5].

Definition 2 (-Byzantine resilience).

Let , and . Consider random vectors among which at least are i.i.d. from a common distribution . Let be a random vector characterizing this distribution. A GAR is said to be -Byzantine resilient for if its output satisfies the following two properties:

  1. [leftmargin=20pt]

  2. , and

  3. for any , is upper bounded by a linear combination of where .

This condition has been shown critical to ensure convergence of the distributed SGD algorithm in the presence of up to Byzantine workers [5, 14]. Thus, it serves as an excellent starting point for studying the Byzantine resilience of distributed DP-SGD. Consequently, we consider the algorithm where the server implements a Byzantine robust GAR while the honest workers follow instructions prescribed in DP-SGD.

3 Combining Differential Privacy and Byzantine Resilience

Algorithm 1, described below, combines the standard techniques to DP and BR in distributed SGD. Given a GAR and a noise injection parameter , Algorithm 1 computes steps of distributed DP-SGD with as an aggregation rule at the server to guarantee BR.

Setup: The server chooses an arbitrary initial parameter vector , learning rates , and a deterministic GAR . Honest workers have a fixed batch size and noise injection parameter .
for  do
        The server broadcasts to all workers.
        foreach honest worker  do
               1. builds a set by sampling points at random without replacement from and computes a noisy gradient estimate with noise injection parameter , i.e., it computes
(2)
2. sends the resulting noisy gradient to the server.
        end foreach
       foreach Byzantine worker  do
               sends to the server a (possibly arbitrary) vector as its "gradient".
        end foreach
       The server computes the aggregate of the received gradients using , i.e., it computes
(3)
The server updates the parameter vector using the learning rate as follows
(4)
end for
Algorithm 1 Distributed DP-SGD with Byzantine resilience

Note that when , i.e., when no noise is injected, Algorithm 1 reduces to a classical Byzantine resilient distributed SGD algorithm as presented in prior works such as Blanchard et al. [5] and El Mhamdi et al. [14]. Furthermore, when and is the average function, it reduces to a distributed implementation of the well-known DP-SGD scheme, e.g., from [1].

3.1 Differential Privacy Guarantee

Intuitively, Algorithm 1 should inherit the privacy guarantees of DP-SGD. Indeed, the privacy preserving scheme applied at the worker level is the same and will not by altered by the GAR thanks to the post-processing property of DP [13]. Then, owing to previous works, we can easily show that Algorithm 1 satisfies -DP at each step and for each honest worker when . Furthermore, as shown in Theorem 1, we can obtain a much tighter analysis using advanced analytical tools such as privacy amplification via sub-sampling [2].

Theorem 1.

Suppose that Assumption 1 holds true. Let and . Consider Algorithm 1 with . Then, each honest worker satisfies -DP at every step of the procedure.

Henceforth, whenever we refer to Algorithm 1 with per-step and per-worker privacy budget , we will consider as defined in Theorem 1 above.

3.2 Inapplicability of Existing Results from the BR Literature

As discussed in Section 2, prior works on BR can demonstrate the convergence of Algorithm 1 if the GAR is -Byzantine resilient during the entire learning process. However, verifying the validity of -BR is nearly impossible as the condition depends upon the gradients of the Byzantine workers that can be arbitrary [5]. The only verifiable condition known in the literature to guarantee -BR is the variance-to-norm (VN) condition, which is defined as follows [14].

Definition 3 (VN Condition).

For a parameter vector , let denote the random vector characterizing the gradients sent by the honest workers to the server at . A GAR satisfies the VN condition if for any such that has a non-zero mean,

where is the multiplicative constant of GAR that depends on and .111Precise values of for most popular GARs can be found in Appendix B.

This condition means that for a GAR to guarantee convergence for the procedure, the distribution of the gradient estimates at parameter must be "well-behaved". For instance, if the norm of the expected stochastic gradients converges to then so should the variance. Note that in the case of Algorithm 1, from (2) we obtain that for any ,

(5)

where is a set of data points sampled randomly without replacement from , and . Thus, the VN condition can no longer be satisfied whenever , i.e., workers follow instructions prescribed in DP-SGD. We show this formally in Proposition 1 below.

Proposition 1.

Let . Consider Algorithm 1 with . If Assumption 3 holds true, then there exists no GAR that satisfies the VN condition.

Note that when and are non-zero, we will have as explained in Section 3.1. Accordingly, Proposition 1 means that prior results on the convergence of existing Byzantine resilient GARs, including the works by Blanchard et al. [5] and El Mhamdi et al. [14], are no longer valid when enforcing any non-zero level of DP. Although the VN condition is only a sufficient one, due to the lack of necessary conditions in the literature, it is the most widely used tools for proving BR, e.g., see Blanchard et al. [5], El Mhamdi et al. [14], Xie et al. [40], El-Mhamdi et al. [15], Boussetta et al. [8]. Hence, Proposition 1 highlights an inherent limitation of the theory of BR, especially when simultaneously enforcing DP via noise injection.

3.3 Adapting the Theory of BR to Account for DP

To circumvent the aforementioned limitation, we propose a relaxation of the theory of -BR by relaxing the original VN condition to the -approximated VN condition defined below.

Definition 4 (-approximated VN condition).

Let denote the random vector characterizing the gradients sent by the honest workers to the server at parameter vector . For , a GAR satisfies the -approximated VN condition if for all such that ,

where is the multiplicative constant of GAR that depends on and .

Definition 4 relaxes the initial VN condition by allowing a subset of (possible) parameter vectors to violate the inequality in Definition 3. In particular, as , when the gradients are sufficiently close to a local minimum, or , the inequality need not be satisfied. While the -approximated VN condition is a natural extension of Definition 3, it enables us to study cases where the distribution of the gradients at is non-trivial, e.g., the variance of need not vanish when approaches . Consequently, we can utilize this new criterion to analyze the convergence of Algorithm 1 for different GARs and levels of privacy. Assuming -approximated VN condition, we show in Theorem 2 the approximate convergence of Algorithm 1.

Theorem 2.

Let and . Consider Algorithm 1 with , a GAR satisfying the -approximated VN condition, and for all . If Assumptions 12, and 3 hold true, then there exists and such that for any ,

where is the minimum value of , i.e., , and .

According to Theorem 2, Algorithm 1 can compute a parameter for which in expectation with a rate of . In other words, when the loss function is regularized (see, e.g., Bottou et al. [6]), it finds an approximate local minimum with an error proportional to . Note that, when (i.e., when DP is not enforced), the above result encapsulates the existing convergence results from the BR literature, e.g., Blanchard et al. [5], El Mhamdi et al. [14].

Remark 1.

For generality, we do not provide the exact values for parameters and in Theorem 2. These two constants depend on the learning scheme that is applied, in particular the resilience properties of the GAR used. However, since these parameters are constant throughout the learning procedure, keeping them to be generic does not affect our conclusions on the asymptotic error.

3.4 Studying the Interplay between DP and BR

The value of is intrinsically linked to the amount of noise that workers inject to the procedure. In a way, it represents the impact of per-worker DP on the resilience of Algorithm 1 to Byzantine workers. To quantify this impact, we present in Proposition 2 sufficient and necessary conditions for a GAR to satisfy the -approximated VN condition in the context of Algorithm 1.

Proposition 2.

Let , . Consider Algorithm 1 with privacy budget and GAR with multiplicative constant . Then, the following assertions hold true.

  1. [leftmargin=1cm]

  2. Under Assumptions 1 and 3, the -approximated VN condition can hold true only if

  3. Additionally under Assumption 2, if

    then satisfies the -approximated VN condition.

The above result, in conjunction with Theorem 2, presents a convergence guarantee that can be obtained by distributed DP-SGD under Byzantine faults. In particular, we have the following corollary of Theorem 2 and Proposition 2.

Corollary 1.

Let , . Consider Algorithm 1 with privacy budget and GAR , and for all . If Assumptions 12, and 3 are satisfied, then for any ,

Corollary 1 quantifies the impact of different parameters on the convergence of the algorithm. For instance, we observe that larger values of and , i.e., weaker DP guarantees, imply smaller worst-case convergence error and therefore, better guarantee of learning. But importantly, it also shows how the convergence guarantee of the algorithm depends upon other hyperparameters, namely the batch size , the number of parameters , and the multiplicative constant of the GAR. Let us for example take the case of the batch size below.

Impact of batch size. We consider the specific GAR of Minimum-Diameter Averaging (MDA) for which  [14]. Then, from Corollary 1, we obtain that

From above, we note that when parameters and are in the interval and , i.e., both DP and BR are enforced, then increasing the batch size indeed reduces the asymptotic convergence error of the algorithm. However, this is not the case when we consider DP and BR separately. When all workers are honest, , which implies . The algorithm then asymptotically converges to a local minimum regardless of the batch size used. On the other hand, when the workers do not obfuscate their gradients (), the -approximated VN condition holds true for . Then, the asymptotic convergence error of the algorithm is again independent of the batch size. To conclude, the batch size plays a crucial role in improving the learning accuracy when enforcing DP and BR simultaneously, but it should have little influence when considering them individually.

Remark 2.

Although Corollary 1 provides some useful insights on improving the accuracy of the learning algorithm combining DP and BR, it need not be tight as it only provides an upper bound relying on a sufficient condition; the -approximated VN condition. It turns out that providing a non-trivial lower bound for distributed SGD in the presence of Byzantine faults remains an open problem, even without DP. In spite of this, we show the practical relevance of the insights obtained from Corollary 1 through an exhaustive set of experiments in the subsequent section.

4 Numerical Experiments

The goal of our experiments is to investigate whether our theoretical insights are actually applicable in practice and whether hyperparameter optimization (HPO) can improve the integration of DP and BR. Accordingly, we assess the impact of varying different hyperparameters on the training losses and top-1 cross-accuracies of a neural network under -DP and attacks from Byzantine workers over a maximum of learning steps.

4.1 Experimental Setup

Datasets.

We use MNIST [22] and Fashion-MNIST [38]. The datasets are pre-processed before training. MNIST receives an input image normalization with mean

and standard deviation

. Fashion-MNIST is expanded with horizontally flipped images. Due to space limitations, we only showcase here results on the Fashion-MNIST dataset.

Architecture and fixed hyperparmaters.

We consider a feed-forward neural network composed of two fully-connected linear layers of respectively 784 and 100 inputs (for a total of

parameters) and terminated by a softmax

layer of 10 dimensions. ReLU is used between the two linear layers. We use the Cross Entropy loss, a total number of workers

, Polyak’s momentum of at the workers, a constant learning rate of , and a clipping parameter . We also add an -regularization factor of . Note that some of these constants are reused from the literature on BR, especially from Baruch et al. [3], Xie et al. [41], El-Mhamdi et al. [16].

Varying hyperparameters for HPO.

For both datasets, we vary the batch size within 25, 50, 150, 300, 500, 750, 1000, 1250, 1500, the per-step and per-worker privacy parameter in ( is fixed to ), the number of Byzantine workers in as well as the attack they implement (little from Baruch et al. [3] and empire from Xie et al. [41]). We also vary the Byzantine resilient GAR in . Note that due to its large computational cost, we only use the Bulyan aggregation rule when .

Each of the 432 possible combinations of these hyperparameters is run 5 times using seeds from 1 to 5 (for reproducibility purposes), totalling in 2160 runs. Each run satisfies -DP at every step under attacks from Byzantine workers. To assess the impact of the privacy noise alone, we also run the experiments specified above with the averaging GAR and without Byzantine workers (denoted by “No attack”). These experiments account for another 27 combinations, totalling in 135 additional runs. Overall, we performed a comprehensive set of 2295 runs for which we provide a brief summary below. More details on the experimental setup and results can be found in Appendices D and E.

4.2 Experimental Results

Figure 2: Maximum top-1 cross-accuracy reached on Fashion-MNIST when only varying the batch size for different threat scenarios and different GARs. The first and second rows show the little and empire attacks respectively. The first and second columns display and respectively. All reported metrics include a standard deviation obtained with the 5 consecutive runs. As a reference, note that the maximal top-1 cross-accuracy achieved by the tested model in the vanilla setting (i.e., with neither DP nor Byzantine faults) is around 84–85% for Fashion-MNIST.

In Figure 2, we give a snapshot of our results by showcasing 4 characteristic outcomes encountered. Below, we present further characterization of them. Besides validating our theoretical insights on the impact of the batch size and the GAR selection on the convergence of Algorithm 1, these plots also showcase the threat scenarios in which hyperparameter optimization (HPO) has the most impact. Note that the little attack was more damaging than empire in our experiments; hence in our discussion below, we consider little to be a stronger threat than empire, ceteris paribus.

  1. [leftmargin=0.5cm]

  2. Strongest threat scenario (top left). We consider little with and , i.e., the strongest level of attack and privacy we implemented. In this stringent scenario, the algorithm fails to deliver good learning accuracy under Byzantine attacks. Although increasing the batch size helps improve the convergence, the accuracy remains quite poor (well below , even when ).

  3. Relaxed threat scenario (bottom left). Here, we keep and , but we trade the attack for a weaker one (empire). This scenario validates our intuition on the advantage of increasing the batch size, but it mostly highlights the impact of GAR selection. Different GARs differ significantly in their maximum cross accuracies, while MDA performs the best.

  4. Mild threat scenario (top right): We now consider and , i.e., a weaker privacy guarantee and a fewer number of Byzantine workers. However, we revert back to little attack. We see that, for all GARs, increasing the batch size significantly improves the maximum cross-accuracy. The choice of GAR also impacts the performance, with Bulyan being the best.

  5. Weakest threat scenario (bottom right): We consider empire with and . The threat is so weak that all GARs perform almost the same. Although HPO still helps to obtain a better accuracy, it is not critical in this setting.

Main Takeaway.

Our empirical results show that training a feed-forward neural network under both DP and BR is possible but expensive in some settings. Indeed, in the non-trivial threat scenarios, to achieve the same maximum cross-accuracy as DP-SGD with , we need a per-worker batch size , i.e.,  times larger than the Byzantine-free setting. Moreover, depending upon the setting, the selection of the GAR might be more influential than the batch size. Finally, note that in the Byzantine-free setting, the DP-SGD algorithm obtains reasonable cross-accuracies (close to ) for most batch sizes considered. This validates our theoretical findings (discussed in Section 3.4) that the batch size has a more significant impact when combining DP and BR compared to when enforcing DP alone. Similar observations on the negligible impact of the batch size in the privacy-free setting (but under Byzantine attacks) can be found in Appendix E.

5 Conclusion & Open problems

In this paper, we have studied the integration of standard approaches to DP and BR, namely the distributed implementation of the popular DP-SGD protocol in conjunction with -BR GARs. Upon highlighting the limitations of the existing theory of BR when applied to this algorithm, we have proposed a generalization of this theory. By doing so, we have (1) quantified the impact of DP on BR, and (2) proposed an HPO scheme to effectively combine DP and BR. Our results have shown that DP and BR can be combined but at the expense of computational cost in some settings.

Our generalization of the theory of -BR is also of independent interest. Specifically, we have proposed a relaxation of the VN condition as -approximated VN condition. Although the VN condition is quite stringent and only sufficient, it is consistently relied upon to design and study different Byzantine resilient GARs [5, 14, 15, 16, 8]. Hence, our convergence result, obtained using the relaxed -approximated VN condition, supersedes many existing results in the literature of BR.

Interestingly, we have observed through our experiments that even when the relaxed -approximated VN condition is violated, the algorithm obtains reasonable learning accuracy. This observation opens two interesting problems expounded below.

  1. [leftmargin=0.5cm]

  2. A theoretical problem: The VN condition (either approximated or not) is not tight enough to fully characterize BR. That is, in some cases, a GAR may be -BR without satisfying the VN condition. Furthermore, the theory of BR focuses on "worst-case" attacks that, for now, might not be achievable in practice. Hence, the question on the tightness of the VN condition for any specific attack, even without DP, remains open.

  3. An empirical problem: The practice of BR focuses on state-of-the-art realizable attacks. These attacks are arguably sub-optimal explaining why we can obtain reasonable learning accuracy despite the violation of the VN condition. This also calls for designing better (or stronger) attacks.

Finally, while we have focused on adapting the theory of BR to make it more compatible with the standard DP-SGD algorithm, an alternate future direction could be to investigate other DP mechanisms that may comply better with classical approaches to BR, while preserving DP guarantees.

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. pp. 308–318. Cited by: Appendix A, §D.2, §1, §1, §2.1, §2, §3.
  • [2] B. Balle, G. Barthe, and M. Gaboardi (2018) Privacy amplification by subsampling: tight analyses via couplings and divergences. Red Hook, NY, USA, pp. 6280–6290. Cited by: §C.1, §3.1, Lemma 2.
  • [3] M. Baruch, G. Baruch, and Y. Goldberg (2019) A little is enough: circumventing defenses for distributed learning. Cited by: §D.3, Figure 1, §4.1, §4.1.
  • [4] R. Bassily, A. Smith, and A. Thakurta (2014) Private empirical risk minimization: efficient algorithms and tight error bounds. pp. 464–473. External Links: Document Cited by: §2.1.
  • [5] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer (2017) Machine learning with adversaries: byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 119–129. External Links: Link Cited by: Appendix A, §B.1, §C.4, §1, §2, §2.2, §2.2, §3.2, §3.2, §3.3, §3, §5.
  • [6] L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §2, §3.3.
  • [7] L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, Y. Lechevallier and G. Saporta (Eds.), Heidelberg, pp. 177–186. External Links: ISBN 978-3-7908-2604-3 Cited by: §1.
  • [8] A. Boussetta, E. El-Mhamdi, R. Guerraoui, A. Maurer, and S. Rouault (2021) AKSEL: fast byzantine sgd. Cited by: Appendix A, §3.2, §5.
  • [9] X. Chen, J. Ji, C. Luo, W. Liao, and P. Li (2018) When machine learning meets blockchain: a decentralized, privacy-preserving and secure design. pp. 1178–1187. Cited by: Appendix A, §1.
  • [10] G. Damaskinos, C. Mendler-Dünner, R. Guerraoui, N. Papandreou, and T. Parnell (2021-05) Differentially private stochastic coordinate descent.

    Proceedings of the AAAI Conference on Artificial Intelligence

    35 (8), pp. 7176–7184.
    External Links: Link Cited by: Appendix A.
  • [11] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng (2012) Large scale distributed deep networks. pp. . External Links: Link Cited by: §1.
  • [12] C. Dwork and J. Lei (2009) Differential privacy and robust statistics. New York, NY, USA, pp. 371–380. External Links: ISBN 9781605585062, Link, Document Cited by: Appendix A, §1.
  • [13] C. Dwork, A. Roth, et al. (2014) The algorithmic foundations of differential privacy.. Foundations and Trends in Theoretical Computer Science 9 (3-4), pp. 211–407. Cited by: §1, §2.1, §3.1, Lemma 1.
  • [14] E. M. El Mhamdi, R. Guerraoui, and S. Rouault (2018-10–15 Jul) The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3521–3530. External Links: Link Cited by: Appendix A, §B.2, §B.4, §C.4, §1, §2.2, §3.2, §3.2, §3.3, §3.4, §3, §5.
  • [15] E. El-Mhamdi, R. Guerraoui, A. Guirguis, L. N. Hoang, and S. Rouault (2020) Genuinely distributed byzantine machine learning. In Proceedings of the 39th Symposium on Principles of Distributed Computing24th International Conference on Principles of Distributed Systems (OPODIS 2020)Proceedings of the 11th USENIX Conference on Operating Systems Design and ImplementationProceedings of the 32nd International Conference on Machine Learning2018 IEEE International Conference on Big Data (Big Data)NIPS Workshop on Private Multi-Party Machine Learning2013 IEEE Global Conference on Signal and Information Processing2018 IEEE Conference on Decision and Control (CDC)2014 IEEE 55th Annual Symposium on Foundations of Computer ScienceInternational Conference on Learning Representations2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 8-14 December 2019, Long Beach, CA, USAProceedings of the 30th International Conference on Machine Learning9th International Conference on Learning Representations, ICLR 2021, Vienna, Austria, May 4–8, 2021Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications SecurityProceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USAAdvances in Neural Information Processing SystemsProceedings of the 2021 ACM Symposium on Principles of Distributed Computing2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)Advances in Neural Information Processing SystemsProceedings of the Twenty-Second International Conference on Artificial Intelligence and StatisticsProceedings of the 32nd International Conference on Neural Information Processing SystemsProceedings of the 4th Conference on Innovations in Theoretical Computer ScienceInternational Conference on Artificial Intelligence and Statistics2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security2019 IEEE Security and Privacy Workshops, SP Workshops 2019, San Francisco, CA, USA, May 19-23, 2019Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19NeurIPSProceedings of the Forty-First Annual ACM Symposium on Theory of Computing2021 2021 IEEE Symposium on Security and Privacy (SP), F. Bach, D. Blei, S. Dasgupta, D. McAllester, J. G. Dy, A. Krause, K. Chaudhuri, R. Salakhutdinov, F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett, K. Chaudhuri, and M. Sugiyama (Eds.), PODC ’20OSDI’14Proceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning ResearchPODC’21Proceedings of Machine Learning ResearchNIPS’18ITCS ’13CCS ’19STOC ’09, Vol. 37288097252889, New York, NY, USA. External Links: ISBN 9781450375825, Link, Document Cited by: Appendix A, §B.2, §3.2, §5.
  • [16] E. El-Mhamdi, R. Guerraoui, and S. Rouault (2021) Distributed momentum for byzantine-resilient stochastic gradient descent. External Links: Link Cited by: item 1, §4.1, §5.
  • [17] S. Gade and N. H. Vaidya (2018) Privacy-preserving distributed learning via obfuscated stochastic gradients. pp. 184–191. External Links: Document Cited by: Appendix A.
  • [18] R. Guerraoui, N. Gupta, R. Pinot, S. Rouault, and J. Stephan (2021) Differential privacy and byzantine resilience in sgd: do they add up?. New York, NY, USA, pp. 391–401. External Links: ISBN 9781450385480, Link, Document Cited by: Appendix A, §1.
  • [19] L. He, S. P. Karimireddy, and M. Jaggi (2020) Secure byzantine-robust machine learning. External Links: 2006.04747 Cited by: Appendix A, §1.
  • [20] P. Kairouz, S. Oh, and P. Viswanath (2015-07–09 Jul) The composition theorem for differential privacy. Lille, France, pp. 1376–1385. External Links: Link Cited by: §D.2, §2.1.
  • [21] L. Lamport, R. Shostak, and M. Pease (1982-07) The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4 (3), pp. 382–401. External Links: ISSN 0164-0925, Link, Document Cited by: §1.
  • [22] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §4.1.
  • [23] M. Lécuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. pp. 656–672. External Links: Document Cited by: Appendix A, §1.
  • [24] Y. Ma, X. Zhu, and J. Hsu (2019-07) Data poisoning against differentially-private learners: attacks and defenses. pp. 4732–4738. External Links: Document, Link Cited by: Appendix A, §1.
  • [25] M. Naseri, J. Hayes, and E. D. Cristofaro (2020) Toward robustness and privacy in federated learning: experimenting with local and central differential privacy. ArXiv abs/2009.03561. Cited by: Appendix A.
  • [26] (2021)

    Opacus PyTorch library

    .
    Note: Available from opacus.ai Cited by: §D.2.
  • [27] R. Pinot, F. Yger, C. Gouy-Pailler, and J. Atif (2019) A unified view on differential privacy and robustness to adversarial examples. arXiv preprint arXiv:1906.07982. Cited by: Appendix A, §1.
  • [28] R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. pp. 909–910. External Links: Document Cited by: Appendix A.
  • [29] J. So, B. Guler, and A. S. Avestimehr (2020) Byzantine-resilient secure federated learning. External Links: 2007.11115 Cited by: Appendix A, §1.
  • [30] L. Song, R. Shokri, and P. Mittal (2019) Membership inference attacks against adversarially robust deep learning models. pp. 50–56. External Links: Document Cited by: Appendix A, §1.
  • [31] L. Song, R. Shokri, and P. Mittal (2019) Privacy risks of securing machine learning models against adversarial examples. New York, NY, USA, pp. 241–257. External Links: ISBN 9781450367479, Document Cited by: Appendix A, §1.
  • [32] S. Song, K. Chaudhuri, and A. D. Sarwate (2013) Stochastic gradient descent with differentially private updates. pp. 245–248. External Links: Document Cited by: Appendix A, §2.1.
  • [33] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Training very deep networks. pp. . External Links: Link Cited by: §1.
  • [34] Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan (2019) Can you really backdoor federated learning?. CoRR abs/1911.07963. External Links: Link, 1911.07963 Cited by: Appendix A, §1.
  • [35] F. Tang, W. Wu, J. Liu, and M. Xian (2019-04) Privacy-preserving distributed deep learning via homomorphic re-encryption. Electronics 8, pp. 411. External Links: Document Cited by: Appendix A.
  • [36] H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J. Sohn, K. Lee, and D. S. Papailiopoulos (2020) Attack of the tails: yes, you really can backdoor federated learning. External Links: Link Cited by: Appendix A.
  • [37] Y. Wang, B. Balle, and S. P. Kasiviswanathan (2019-16–18 Apr)

    Subsampled renyi differential privacy and analytical moments accountant

    .
    pp. 1226–1235. External Links: Link Cited by: §2.1.
  • [38] H. Xiao, K. Rasul, and R. Vollgraf (2017-08-28) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.
  • [39] C. Xie, O. Koyejo, and I. Gupta (2018) Generalized byzantine-tolerant sgd. External Links: 1802.10116 Cited by: Appendix A.
  • [40] C. Xie, O. Koyejo, and I. Gupta (2018) Phocas: dimensional byzantine-resilient stochastic gradient descent. External Links: 1805.09682 Cited by: Appendix A, §3.2.
  • [41] C. Xie, O. Koyejo, and I. Gupta (2019) Fall of empires: breaking byzantine-tolerant SGD by inner product manipulation. pp. 83. Cited by: §D.3, §4.1, §4.1.
  • [42] D. Yin, Y. Chen, R. Kannan, and P. Bartlett (2018-10–15 Jul) Byzantine-robust distributed learning: towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 5650–5659. External Links: Link Cited by: Appendix A, §B.3.
  • [43] L. Zhu, Z. Liu, and S. Han (2019) Deep leakage from gradients. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 14774–14784. External Links: Link Cited by: §1, §2.

Appendix A Related Work

Privacy.

In the past, significant attention has been given to protecting data privacy for both centralized [32, 10, 1] and distributed SGD [28, 25]. Although several techniques for data protection exist such as the encryption [35] or obfuscation [17] of gradients, the most standard approach consists in adding DP noise to the gradients computed by the workers [1, 32, 28, 25], which is what we consider. However, these works only consider a fault-free setting where all workers are assumed to be honest.

Byzantine resilience.

In a separate line of research, several other works have designed Byzantine resilient schemes for distributed SGD in the parameter-server architecture [5, 14, 42, 39, 40, 15, 8]. Nevertheless, in these papers, the training data is not protected, meaning that their methods do not consider the privacy threat associated with sharing unencrypted gradients with the server.

Combining privacy and BR.

Although scarce, there has been some work on tackling the problem of combining privacy and BR. For instance, He et al. [19] consider this problem for a different framework that includes two honest-but-curious non-colluding servers, a strong assumption that does not always hold in practice. Furthermore, their additive secret sharing scheme is rendered ineffective in our setting where there is a single honest-but-curious server that obtains information from all the workers. In the context of privacy, the single-server setting generalizes the multi-server setting with colluding servers. Another related work, the BREA framework, proposes the use of verifiable secret sharing amongst workers [29]. However, the presented privacy scheme scales more poorly than DP mechanisms, and is infeasible in most distributed ML settings with no inter-worker communication. Chen et al. [9] propose the LearningChain framework that is claimed to combine DP and BR. However, LearningChain is an experimental method, and Chen et al. do not provide any formal guarantees either on the resilience or on the convergence of the proposed algorithm.

Recently, Guerraoui et al. [18] studied the problem of satisfying both DP and BR in a single-server distributed SGD framework. While they demonstrate the computational hardness of this problem in practice, we go beyond by showing an inherent incompatibility between the supporting theory of -BR and the Gaussian mechanism from DP. Moreover, our approximate convergence result generalizes the prior works on BR. This generalization is critical to quantifying the interplay between DP and BR. Importantly, while Guerraoui et al. [18] only give elementary analysis explaining the difficulty of the problem, we show that a careful analysis can help combine DP and BR.

Studying the interplay between DP and other notions of robustness.

There has been a long line of work studying the interplay and mutual benefits of DP and robustness to data corruption in the centralized learning setting [12, 24]. However, these works do not consider the problem of a distributed scenario with an honest-but-curious server, and they are not applicable to our setting. Furthermore, data corruption is actually a weaker threat than BR as the adversary cannot select its gradients online to disrupt the learning process.

Recently, there have been some work on the interplay between DP and robustness to evasion attacks (a.k.a. adversarial examples). Interestingly, some findings in that line of research are similar to ours. DP and robustness to adversarial examples have been demonstrated to be very close from a high-level theoretical point of view even if their semantics are very different [23, 27]. However, some recent works have pointed out that these two notions might be conflicting in some settings [30, 31]. It is however worth noting that BR and robustness to adversarial examples are two orthogonal concepts. In particular, the robustness of a model (at testing time) to evasion attacks does not provide any guarantee on the robustness (at training time) to Byzantine behaviors. Similarly, as BR focuses on the training (optimization) procedure, we can always train models using a Byzantine resilient aggregation rule but without obtaining robustness to evasion attacks. The connection between these two notions of robustness remains an open problem.

Finally, Sun et al. [34] suggested that in the context of federated learning, differential privacy could help defend against backdoor attacks. However, this hypothesis got challenged by Wang et al. [36].

Appendix B Standard GARs With Associated Multiplicative Constants

In this section, we present the different GARs used in our experiments, along with their associated VN conditions (Definition 3) and multiplicative constants .

b.1 Krum

Krum is an aggregation rule introduced under the assumption that . It consists in selecting the gradient which has the smallest mean squared distance, where the mean is computed over its closest gradients [5]. Formally, let be the gradients received by the parameter server. For any and , we denote by the fact that is amongst the closest vectors (in distance) to within the submitted gradients. Krum assigns to each a score

(6)

and outputs the gradient with the lowest score. Blanchard et al. [5] prove that Krum is -Byzantine resilient, assuming that the following VN condition is satisfied:

(7)

Therefore, the multiplicative constant for Krum is

(8)

b.2 Minimum-Diameter Averaging (Mda)

MDA is an aggregation rule introduced under the assumption that . It outputs the average of the most clumped gradients among the received ones [14, 15]. Formally, let be the set of gradients received by the parameter server and let be the set of all subsets of of cardinality . MDA chooses the set

(9)

and outputs the average of the vectors in . El Mhamdi et al. [14] prove that MDA is -Byzantine resilient, assuming that the following VN condition holds true:

Therefore, the multiplicative constant for MDA is

(10)

b.3 Median

Yin et al. [42] introduce the Median aggregation rule under the assumption that . When using Median, the parameter server outputs the coordinate-wise median of the submitted gradients. We recall that every submitted gradient , where is the number of parameters of the model. Formally, Median is defined as follows

(11)

where is the th coordinate of , and median is the real-valued median. In other words,

where . The VN condition for Median is the following:

Therefore, the multiplicative constant for Median is

(12)

b.4 Bulyan

Bulyan is an aggregation rule defined under the assumption that . It is actually not an aggregation rule in the conventional sense, but rather an iterative method that repetitively uses an existing GAR [14]. In this paper, we use Bulyan on top of Krum defined above. Formally, Bulyan uses Krum times iteratively, each time discarding the highest-scoring gradient. After that, the parameter server is left with a set of the "lowest-scoring" gradients selected by Krum, as mentioned in Appendix B.1. Bulyan then outputs the average of the closest gradients to the coordinate-wise median of the (selected) gradients .

The VN condition for Bulyan is the same as that of Krum (i.e., equation 7). Therefore, the multiplicative constant for Bulyan is

(13)

Appendix C Proofs omitted from the main paper

c.1 Technical background on privacy

Before demonstrating Theorem 1, we recall some classical tools from the DP literature. Below, we recall the definition of sensitivity, the privacy guarantee of the Gaussian noise injection, and the notion notion of privacy amplification by sub-sampling.

Definition 5 (Sensitivity).

Let . The sensitivity of , denoted by , is the maximum norm of the difference between the outcomes of when applied on any two adjacent datasets, i.e.,

where denote the adjacency between the databases and from .

Using this notion of sensitivity, we can demonstrate that the Gaussian noise injection scheme (a.k.a. the Gaussian mechanism) satisfies -DP for a well chosen noise injection parameter .

Lemma 1 ([13]).

Let , , and . The scheme that takes as input, and outputs

satisfies -DP if .

Finally, let us introduce the concept of privacy amplification by sub-sampling. Here, we study sub-sampling without replacement defined as follows.

Definition 6.

(Sub-sampling) Given a dataset and a constant , the procedure selects points at random and without replacement from .

This sub-sampling procedure has been widely studied in the privacy preserving literature and is known to provide privacy amplification. In particular, Balle et al. [2] demonstrated that it satisfies the following privacy amplification lemma.

Lemma 2 (Balle et al. [2]).

Let , , , and be an arbitrary output space. Let be an -DP algorithm and defined as . Then is -DP, with and .

c.2 Proof of Theorem 1

Theorem 1.

Suppose that Assumption 1 holds true. Let and . Consider Algorithm 1 with . Then, each honest worker satisfies -DP at every step of the procedure.

Proof.

Let be an arbitrary step of Algorithm 1 and the parameter at step . Let us consider an arbitrary honest worker . Note that the batch on which computes its gradient estimate is constituted of points randomly sampled without replacement from . Hence we can write . We now denote by the function that evaluates the mean gradient at using . Specifically,

(14)

We denote by the noise injection scheme, i.e., for any ,

(15)

Following the above notation, at step , the honest worker computes the noisy gradient estimate . Hence, it suffices to show that satisfies -DP to conclude the proof.

Since two adjacent datasets can only differ on one row, using Assumption 1, we have that . In particular, this implies that Then, according to Lemma 1, is -DP with . Finally, it suffices to use Lemma 2 to conclude that is -differentially private. ∎

c.3 Proof of Proposition 1

Proposition 1.

Let . Consider Algorithm 1 with . If Assumption 3 holds true, then there exists no GAR that satisfies the VN condition.

Proof.

Let us consider an arbitrary GAR with multiplicative constant . We denote the set of critical points of by . While considering Algorithm 1

, the random variable that characterizes the gradients sent by the honest workers at a given parameter vector is defined as follows, for all

,

where is a set of points randomly sampled without replacement from (denoted ) and . To show that the VN condition (in Definition 3) does not hold true, we show that there exists such that

For doing so, we first observe that for any ,

is an unbiased estimator of

, i.e., . Furthermore, note that the injected noise is independent from the stochasticity of gradient estimate . Hence, for all ,

(16)

As admits non-trivial minima, we know that . Accordingly, there exists and . Without loss of generality, we can always take and such that

where is the constant defined in Assumption 3. Thus, using Assumption 3 we get

(17)

Furthermore, thanks to (16) we know that

(18)

Finally, using (17) and (18) we obtain that

The above concludes the proof. ∎

c.4 Proof of Theorem 2

Before we prove the theorem, we note the following implication of Assumption 2.

Lemma 3.

Under Assumption 2, for a given parameter ,

where recall that is a batch of data points chosen randomly from dataset .

Proof.

Consider an arbitrary . Then,

By triangle inequality, and the fact that is a convex function, we obtain that

Recall that is a set of points randomly sampled without replacement from , which we denote by . Thus, given ,

Therefore, from above we obtain that

(19)

Note that

Finally, substituting above from Assumption 2 we obtain that

Substitution from above in (19) concludes the proof. ∎

We now present the proof of Theorem 2, which is re-stated below for convenience.

Theorem 2.

Let and . Consider Algorithm 1 with , a GAR satisfying the -approximated VN condition, and for all . If Assumptions 12, and 3 hold true, then there exists and such that for any ,