The Skellam Mechanism for Differentially Private Federated Learning

by   Naman Agarwal, et al.
Carnegie Mellon University

We introduce the multi-dimensional Skellam mechanism, a discrete differential privacy mechanism based on the difference of two independent Poisson random variables. To quantify its privacy guarantees, we analyze the privacy loss distribution via a numerical evaluation and provide a sharp bound on the Rényi divergence between two shifted Skellam distributions. While useful in both centralized and distributed privacy applications, we investigate how it can be applied in the context of federated learning with secure aggregation under communication constraints. Our theoretical findings and extensive experimental evaluations demonstrate that the Skellam mechanism provides the same privacy-accuracy trade-offs as the continuous Gaussian mechanism, even when the precision is low. More importantly, Skellam is closed under summation and sampling from it only requires sampling from a Poisson distribution – an efficient routine that ships with all machine learning and data analysis software packages. These features, along with its discrete nature and competitive privacy-accuracy trade-offs, make it an attractive practical alternative to the newly introduced discrete Gaussian mechanism.


page 1

page 2

page 3

page 4


The Poisson binomial mechanism for secure and private federated learning

We introduce the Poisson Binomial mechanism (PBM), a discrete differenti...

The Distributed Discrete Gaussian Mechanism for Federated Learning with Secure Aggregation

We consider training models on private data that is distributed across u...

Differentially private cross-silo federated learning

Strict privacy is of paramount importance in distributed machine learnin...

Privacy-Aware Compression for Federated Data Analysis

Federated data analytics is a framework for distributed data analysis wh...

Beyond Privacy Trade-offs with Structured Transparency

Many socially valuable activities depend on sensitive information, such ...

Federated Learning in Adversarial Settings

Federated Learning enables entities to collaboratively learn a shared pr...

Voting-based Approaches For Differentially Private Federated Learning

While federated learning (FL) enables distributed agents to collaborativ...

1 Introduction

The Gaussian mechanism is the workhorse for a multitude of differentially private learning algorithms [46, 10, 1]. While simple enough for mathematical reasoning and privacy accounting analyses, its continuous nature presents a number of challenges in practice. For example, it cannot be exactly represented on finite computers, making it prone to numerical errors that can break its privacy guarantees [35]. Moreover, it cannot be used in distributed learning settings with cryptographic multi-party computation primitives involving modular arithmetic, such as secure aggregation [12, 11]. To address these shortcomings, the binomial and (distributed) discrete Gaussian mechanisms were recently introduced [19, 2, 15, 25]

. Unfortunately, both have their own drawbacks: the privacy loss for the binomial mechanism can be infinite with a non-zero probability, and the discrete Gaussian: (a) is not closed under summation (i.e. sum of discrete Gaussians is not a discrete Gaussian), complicating analysis in distributed settings and leading to a performance worse than continuous Gaussian in the highly distributed, low-noise regime 

[25]; (b) requires a sampling algorithm that is not shipped with mainstream machine learning or data analysis software packages, making it difficult for engineers to use it in production settings (naïve implementations may lead to catastrophic privacy errors).

Our contributions   To overcome these limitations, we introduce and analyze the multi-dimensional Skellam mechanism, a mechanism based on adding noise distributed according to the difference of two independent Poisson random variables. The Skellam noise is closed under summation (i.e. sums of Skellam random variables is again Skellam distributed) and can be sampled from easily – efficient Poisson samplers are widely available in numerical software packages. Being discrete in nature also means that it can mesh well cryptographic protocols and can lead to communication savings.

To analyze the privacy guarantees of the Skellam mechanism and compare it with other mechanisms, we provide a numerical evaluation of the privacy loss random variable and prove a sharp bound on the Rényi divergence between two shifted Skellam distributions. Our careful analysis shows that for a multi-dimensional query function with sensitivity and sensitivity

, the Skellam mechanism with variance

achieves Rényi differential privacy (RDP) [36] for (see Theorem 3.5). This implies that the RDP guarantees are at most times worse than those of the Gaussian mechanism.

To analyze the performance of the Skellam mechanism in practice, we consider a differentially private and communication constrained federated learning (FL) setting [26] where the noise is added locally to the -dimensional discretized client updates that are then summed securely via a cryptographic protocol, such as secure aggregation (SecAgg) [11, 12]. We provide an end-to-end algorithm that appropriately discretizes the data and applies the Skellam mechanism along with modular arithmetic to bound the range of the data and communication costs before applying SecAgg.

We show on distributed mean estimation and two benchmark FL datasets, Federated EMNIST 

[14] and Stack Overflow [8], that our method can match the performance of the continuous Gaussian baseline under tight privacy and communication budgets, despite using generic RDP amplification via sampling [51] for our approach and the precise RDP analysis for the subsampled Gaussian mechanism [37]

. Our method is implemented in TensorFlow Privacy 

[32] and TensorFlow Federated [24]

and will be open-sourced.

111 While we mostly focus on FL applications, the Skellam mechanism can also be applied in other contexts of learning and analytics, including centralized settings.

Related work   The Skellam mechanism was first introduced in the context of computational differential privacy from lattice-based cryptography [49]

and private Bayesian inference 

[45]. However, the privacy analyses in the prior work do not readily extend to the multi-dimensional case, and they give direct bounds for pure or approximate DP which makes only advanced composition theorems [28, 22] directly applicable in learning settings where the mechanism is applied many times. For example, the guarantees from [49] lead to poor accuracy-privacy trade-offs as demonstrated in Fig. 1. Moreover, we show in Section 3.1 that extending the direct privacy analysis to the multi-dimensional setting is non-trivial because the worst-case neighboring dataset pair is unknown in this case. For these reasons, our tight privacy analysis via a sharp RDP bound makes the Skellam mechanism practical for learning applications for the first time. These guarantees (almost) match those of the Gaussian mechanism and allow us to use generic RDP amplification via subsampling methods [51].

The closest mechanisms to Skellam are the binomial [2, 19] and the discrete Gaussian mechanisms [15, 25]. The binomial mechanism can (asymptotically) match the continuous Gaussian mechanism (when properly scaled). However, it does not achieve Rényi or zero-concentrated DP [36, 13] and has a privacy loss that can be infinite with a non-zero probability, leading to catastrophic privacy failures. The discrete Gaussian mechanism yields Rényi DP and can be applied to distributed settings [25], but it requires a sampling algorithm that is not yet available in data analysis software packages despite being explored in the lattice-based cryptography community (e.g., [43, 18, 38]

). The discrete Gaussian is also not closed under summation and the divergence can be large in highly distributed low-noise settings (e.g. quantile estimation 

[6] and federated analytics [42]), which causes privacy degradation. See the end of Section 4 for more discussion.

2 Preliminaries

We begin by providing a formal definition for -differential privacy (DP) [20].

Definition 2.1 (Differential Privacy).

For , a randomized mechanism satisfies -DP if for all neighboring datasets and all in the range of , we have that

where and are neighboring pairs if they can be obtained from each other by adding or removing all the records that belong to a particular user.

In our experiments we consider user-level differential privacy – i.e., and are neighboring pairs if one of them can be obtained from the other by adding or removing all the records associated with a single user [33]. This is stronger than the commonly-used notion of item level privacy where, if a user contributes multiple records, only the addition or removal of one record is protected.

We also make use of Rényi differential privacy (RDP) [36] which allows for tight privacy accounting.

Definition 2.2 (Rényi Differential Privacy).

A mechanism satisfies -RDP if for any two neighboring datasets , we have that where is the Rényi divergence between and and is given by

A closely related privacy notion is zero-concentrated DP (zCDP) [21, 13]. In fact, -zCDP is equivalent to simultaneously satisfying an infinite family of RDP guarantees, namely -Rényi differential privacy for all . The following conversion lemma from [13, 15, 7] relates RDP to ()-DP.

Lemma 2.3.

If satisfies -RDP, then, for any , satisfies -DP, where

For any query function , we define the sensitivity as where and are neighboring pairs differing by adding or removing all the records from a particular user. We also include the RDP guarantees of the discrete Gaussian mechanism (same RDP guarantees as the continuous Gaussian mechanism) to which we compare our method.

Definition 2.4 (The Discrete Gaussian Mechanism [15]).

Given an integer-valued query and noise variance , the Discrete Gaussian (DGaussian) Mechanism is given by


denotes the discrete Gaussian distribution defined in Equation (1) of 

[15]. The discrete Gaussian mechanism achieves -Rényi DP.

3 The Skellam Mechanism

We begin by presenting the definition of the Skellam distribution, which is the basis of the Skellam Mechanism for releasing integer ranged multi-dimensional queries.

Definition 3.1 (Skellam Distribution).

The multidimensional Skellam distribution over with mean and variance is given with each coordinate distributed independently as

for . Here, is the modified Bessel function of the first kind. A key property of Skellam random variables which motivates their use in DP is that they are closed under summation, i.e. let and then This follows from the fact that a Skellam random variable can be obtained by taking the difference between two independent Poisson random variables with means .222We only consider the symmetric version of Skellam, but it is often more generally defined as the difference of independent Poisson random variables with different variances. We are now ready to introduce the Skellam Mechanism.

Definition 3.2 (The Skellam Mechanism).

Given an integer-valued query , we define the Skellam Mechanism as

and the total error of the mechanism is bounded by .

The Skellam mechanism was first introduced in [49] for the scalar case. As our goal is to apply the Skellam mechanism in the learning context, we have to address the following challenges. (1) Tight privacy compositions: Learning algorithms are iterative in nature and require the application of the DP mechanism many times (often ). The current direct approximate DP analysis in [49] can be combined with advanced composition (AC) theorems [28, 22] but that leads to poor privacy-accuracy trade-offs (see Fig. 1).

Figure 1: Comparing privacy compositions across various mechanisms and accounting methods.

(2) Privacy analysis for multi-dimensional queries: In learning algorithms, the differentially private queries are multi-dimensional (where the dimension equals the number of model parameters, typically ). Using composition theorems lead to poor accuracy-privacy trade-offs and a direct extension of approximate DP guarantee [49] for the multi-dimensional case leads to a strong dependence on sensitivity which is prohibitively large in high dimensions. (3) Data discretization:

The gradients are naturally continuous vectors but we would like to apply an integer based mechanism. This requires properly discretizing the data while making sure that the norm of the vectors (sensitivity of the query) is preserved. We will tackle challenges (1) and (2) in the remainder of this section and leave (3) for the next section.

3.1 Tight Numerical Accounting via Privacy Loss Distributions

We begin by defining the notion of privacy loss distributions (PLDs).

Definition 3.3 (Privacy Loss Distribution).

For a multi-dimensional discrete privacy mechanism and neighboring datasets , for any , we define . The privacy loss random variable of at () is [22]. The privacy loss distribution (PLD) of , denoted by , is the distribution of .

The PLD of a mechanism can be used to characterize its -DP guarantees.

Lemma 3.4.

A mechanism is -DP if and only if for all neighboring datasets where .

When a mechanism is applied times on a dataset, the overall PLD of the composed mechanism at is the -fold convolution of  [22]

. Since discrete convolutions can be computed efficiently using fast Fourier transforms (FFTs) and the expectation in Lemma 

3.4 can be numerically approximated, PLDs are attractive for tight numerical accounting [30, 34, 17]. Applying the above to the Skellam mechanism, a direct calculation shows that with are i.i.d. according to ,

When , it suffices to look at , where and . Since

has a discrete and symmetric probability distribution and the

function is monotonic, the distribution of can be easily characterized. This gives us a tight numerical accountant for the Skellam mechanism in the scalar case, which we use to compare it with both the Gaussian and discrete Gaussian mechanisms. Fig. 1 shows this comparison, highlighting the competitiveness of the Skellam mechanism and the problem of combining the direct analysis of [49] with advanced composition (AC) theorems. When , there are combinatorially many ’s that need to be considered, even when the sensitivity of is bounded. The discrete Gaussian mechanism faces a similar issue (see Theorem 15 of [15]). To provide a tight privacy analysis in the multi-dimensional case, we prove a bound on the RDP guarantees of the Skellam mechanism in the next subsection. Fig. 1 and 2 show that our bound is tight and the competitiveness of the Skellam mechanism in high dimensions.

3.2 Tight Accounting via Rényi Differential Privacy

The following theorem states our main theoretical result, providing a relatively sharp bound on the RDP properties for the Skellam machanism.

Theorem 3.5.

For , and sensitivity , the Skellam Mechanism is -RDP with


To remind the reader in comparison, the Gaussian mechanism is -RDP with . The bound we provide is at most worse than the bound for the Gaussian, which is negligible for all practical choices of , especially as the privacy requirements increase.333The restriction that needs to be an integer is a technical one owing to known bounds on Bessel functions. In practice as we show, this restriction has a negligible effect. Next we show a simple corollary which follows via the independent composition of RDP across dimensions.

Corollary 3.6.

The multi-dimensional Skellam Mechanism is -RDP with


where and are the and sensitivities respectively.

3.2.1 Proof Overview for Theorem 3.5

In this subsection, we provide the proof of Theorem 3.5 assuming a technical bound on the ratios of Bessel functions presented as Lemma 3.7, which is the core of our analysis and may be of independent interest. We provide a proof overview for Lemma 3.7, deferring the full proof to the appendix.

On a macroscopic level, our proof structure mimics the RDP proof for the Gaussian mechanism [36], and the main object of our interest is to bound the following quantity, defined for any :


The following lemma states our main bound on this quantity.

Lemma 3.7.

For any , with and , we have that for all

Note that in contrast if we consider the analogous notion of for the Gaussian mechanism (replacing with the Gaussian density ), we readily get the bound , which is the same as our bound up to lower order terms. We now provide the proof of Theorem 3.5.

Proof of Theorem 3.5.

By RDP definition (2.2), we need to bound the following for any , ,

Now consider the following calculations on the term:

where the inequality follows from Lemma 3.7. ∎

We now provide an overview for the proof of Lemma 3.7 highlighting the crux of the argument. As a first step we collect some known facts regarding Bessel functions. It is known that for and , , is a decreasing function in , and is an increasing function in [47]. A succession of works consider bounding the ratio of successive Bessel functions , which is a natural quantity to considering the objective in Lemma 3.7. We use the following very tight characterization for this recently proved in [44, Theorem 5].

Lemma 3.8.

For any define the following function we have that

where is defined as .

Standard bounds such as those appearing in [5, 49] lead to the following conclusion:

While the above bound is significantly easier to work with, it leads to an RDP guarantee of Gaussian RDP + . In high dimensions this manifests as and overall leads to a constant multiplicative factor over the Gaussian. On the other hand we prove a Gaussian RDP + bound. Our proof of Lemma 3.7 splits into various cases depending on the signs of the quantities involved. We show the derivation for a single case below and defer the full proof to the appendix.

Proof of Lemma 3.7 in the case , .

Replacing we get that

where the first inequality follows from Lemma 3.8 and the fact that for all , and the second inequality follows from Lemma A.1 (provided in the appendix):

Figure 2: Benchmarking Skellam on sensitivity-1 queries under various accounting methods. RDP: Rényi DP. PLD: privacy loss distributions. Skellam (Direct): [49]. Gaussian (Analytic): [9]. DGaussian [15] / DDGauss [25]: central / distributed discrete Gaussian. is the scaling factor applied to both and . For Skellam and DDGauss [25], the central noise with std is split into shares each applied locally with std ; a large and small can thus exacerbate the sum divergence term of DDGauss (left). Left: . Right: .

4 Applying the Skellam Mechanism to Federated Learning

With a sharp RDP analysis for the multi-dimensional Skellam mechanism presented in the previous section, we are now ready to apply it to differentially private federated learning. We first outline the general problem setting and then describe our approach under central and distributed DP models.

Problem setting At a high-level, we consider the distributed mean estimation problem. There are clients each holding a vector in such that for all , the vector norm is bounded as for some . We denote the set of vectors as , and the aim is for each client to communicate the vectors to a central server which then aggregates them as for an external analyst. In federated learning, the client vectors are the model gradients or model deltas (typically ) after training on the clients’ local datasets, and this procedure can be repeated for many rounds (). A large and thus necessitate accounting methods that provide tight privacy compositions for high-dimensional queries.

We are primarily concerned with three metrics for this procedure and their trade-offs: (1) Privacy: the mean should be differentially private with a reasonably small ; (2) Error: we wish to minimize the expected error; and (3) Communication: we wish to minimize the average number of bits communicated per coordinate. Characterizing this trade-off is an important research problem. For example, it has been recently shown [50] that without formal privacy guarantees, the client training data could still be revealed by the model updates ; on the other hand, applying differential privacy [48] to these updates can degrade the final utility.

Figure 3: Comparing Skellam and Distributed Discrete Gaussian (DDGauss) on multi-dimensional real-valued queries, rounded to integers with (Prop. 4.2). is the scaling applied to both and ; a larger reduces the rounding error and norm inflation. is the sampling rate. For Skellam and DDGauss [25], the central noise with std is split into shares each applied locally with std ; a large and small can exacerbate the sum divergence term of DDGauss. Left: Simple setting with . Right: FL-like setting for training CNNs on Federated EMNIST.

Skellam for central DP The central DP model refers to adding Skellam noise onto the non-private aggregate before releasing it to the external analyst. One important consideration is that the model updates in FL are continuous in nature, while Skellam is a discrete probability distribution. One approach is to appropriately discretize the client updates, e.g., via uniform quantization (which involves scaling the inputs by a factor for some bit-width followed by stochastic rounding444Example of stochastic rounding: 42.3 has 0.7 and 0.3 probability to be rounded to 42 and 43, respectively. Other discretization schemes are possible; we do not explore this direction further in this work.

for unbiased estimates), and the server can convert the private aggregate back to real numbers at the end. Note that this allows us to re-parameterize the variance of the added Skellam noise as

, giving the following simple corollary based on Cor. 3.6:

Corollary 4.1 (Scaled Skellam Mechanism).

With a scaling factor , the multi-dimensional Skellam Mechanism is -RDP with


As increases, the RDP of scaled Skellam rapidly approaches that of Gaussian as the second term above approaches 0, suggesting that under practical regimes with moderate compression bit-width, Skellam should perform competitively compared to Gaussian. Another aspect worth noting is that rounding vector coordinates from reals to integers can inflate the -sensitivity , and thus more noise is required for the same privacy. To this end, we leverage the conditional rounding procedure introduced in [25] to obtain a bounded norm on the scaled and rounded client vector:

Proposition 4.2 (Norm of stochastically rounded vector [25]).

Let be a stochastic rounding of vector to the integer grid . Then, for , we have


Conditional rounding is thus defined as retrying the stochastic rounding on until is within the probabilistic bound above (which also gives the inflated sensitivity ). We can then add Skellam noise to the aggregate according to before undoing the quantization (unscaling). Note that a larger scaling before rounding reduces the norm inflation and the extra noise needed (Fig. 3 right).

Skellam for distributed DP with secure aggregation A stronger notion of privacy in FL can be obtained via the distributed DP model [25] that leverages secure aggregation (SecAgg [12]). The fact that the Skellam distribution is closed under summation allows us to easily extend from central DP to distributed DP. Under this model, the client vectors are quantized as in central DP model, but the Skellam noise is now added locally with variance . Then, the noisy client updates are summed via SecAgg ( bits per coordinate for field size ) which only reveals the noisy aggregate to the server. While the local noise might be insufficient for local DP guarantees, the aggregated noise at the server provides privacy and utility comparable to the central DP model, thus removing trust away from the central aggregator. Note that the modulo operations introduced by SecAgg does not impact privacy as it can be viewed as a post-processing of an already differentially private query.

We remark on several properties of the distributed Skellam compared to the distributed discrete Gaussian (DDGauss [25]). (1) DDGauss is not closed under summation, and the divergence between discrete Gaussians can lead to notable privacy degradation in settings such as quantile estimation [6] and federated analytics [42] with sufficiently large number of clients and small local noises (see also the left side of Fig. 2 and Fig. 3). While scaling mitigates this issue, it also requires additional bit-width which makes Skellam attractive under tight communication constraints. (2) Sampling from Skellam only requires sampling from Poisson, for which efficient implementations are widely available in numerical software packages. While efficient discrete Gaussian sampling has also been explored in the lattice-based cryptography community (e.g., [43, 18, 38]), we believe the accessibility of Skellam samplers would help facilitate the deployment of DP to FL settings with mobile and edge devices. See Appendix D for more discussion. (3) In practice where (dictated by bit-width ), both Skellam (cf. Cor. 4.1) and DDGauss (with an exponentially small divergence) quickly approaches Gaussian under RDP, and any differences will be negligible (Fig. 3).

5 Empirical Evaluation

In this section, we empirically evaluate the Skellam mechanism on two sets of experiments: distributed mean estimation and federated learning. In both cases, we focus on the distributed DP model, but note that the Skellam mechanism can be easily adapted to the central DP setting as discussed in the earlier section. Unless otherwise stated, we use RDP accounting for all experiments due to the high-dimensional data and the ease of composition (Section 

3). To obtain for Skellam RDP, we note that since in general and for integers.

Under the distributed DP model, we also introduce a random orthogonal transformation [29, 2, 25]

before discretizing and aggregating the client vectors (which can be reverted after the aggregation); this makes the vector coordinates sub-Gaussian and helps spread the magnitudes of the vector coordinates across all dimensions, thus reducing the errors from quantization and potential wrap-around from SecAgg modulo operations. Moreover, by approximating the sub-exponential tail of the Skellam distribution as sub-Gaussian, we can derive a heuristic for choosing

following [25] based on a bound on the variance of the aggregated signal, as . We choose such that are bounded within the SecAgg field size , where is a small constant.

Algorithm 1 summarizes the aggregation procedure for the distributed Skellam mechanism via secure aggregation as well as the parameters used for the experiments. In summary, we have an clip norm ; per-coordinate bit-width ; target central noise variance ; number of clients ; signal bound multiplier ; and rounding bias . We fix for all experiments. Note that the per-coordinate bit-width is for the aggregated sum as it determines the field size of SecAgg. For federated learning, we also consider the number of rounds and the total number of clients (thus the uniform sampling ratio at every round). Our experiments are implemented in Python, TensorFlow Privacy [32], and TensorFlow Federated [24]. See also Appendix for additional results and more details on the experimental setup.

  Inputs: Private vector for each client ; clip norm ; Bit-width ; Target central noise variance ; Number of clients ; Signal bound multiplier ; Bias .
  Shared randomness: diagonal matrix with uniformly random values, where is the nearest power of 2.
  Shared scale: Obtain scaling factor such that .
  Procedure ClientProcedure()
     Clip and scale vector

, and pad to

dimensions with zeros.
     Random rotation: where is the normalized Hadamard matrix.
     repeat {conditional stochastic rounding}
        Stochastically round the coordinates of of to the integer grid to produce
     until .
     Local noising: Sample noise vector where each entry is sampled from .
     return under the SecAgg protocol with modulo bit-width .
  Procedure ServerProcedure() { is the modular sum of under bit-width }
     return , with .
Algorithm 1 Aggregation Procedure for the Distributed Skellam Mechanism

5.1 Distributed Mean Estimation (DME)

We first consider DME as the generalization of (single round) FL. We randomly generate client vectors from the -dimensional sphere with radius , and compute the true mean . We then compute the private estimate of with the distributed Skellam mechanism (Algorithm 1) as . For a strong baseline, we use the analytic Gaussian mechanism [9] with tight accounting (see also Figure 2). In Figure 4, we plot the MSE as

with 95% confidence interval (small shaded region) over 10 dataset initializations across different values of

, , and . Results demonstrate that Skellam can match Gaussian even with clients as long as the bit-width is sufficient. We emphasize that the communication cost depends logarithmically on , and to put numbers into context, Google’s production next-word prediction models [23, 39] use and the production DP language model [40] uses .

Figure 4: Distributed mean estimation with the distributed Skellam mechanism.

5.2 Federated Learning

Setup We evaluate on three public federated datasets with real-world characteristics: Federated EMNIST [16], Shakespeare [31, 14], and Stack Overflow next word prediction (SO-NWP [8]). EMNIST is an image classification dataset for hand-written digits and letters; Shakespeare is a text dataset for next-character-prediction based on the works of William Shakespeare; and SO-NWP is a large-scale text dataset for next-word-prediction based on user questions/answers from We emphasize that all datasets have natural client heterogeneity that are representative of practical FL problems: the images in EMNIST are grouped the writer of the handwritten digits, the lines in Shakespeare are grouped by the speaking role, and the sentences in SO-NWP are grouped by the corresponding Stack Overflow user. We train a small CNN with model size for EMNIST and use the recurrent models defined in [41]

for Shakespeare and SO-NWP. The hyperparameters for the experiments follow those from 

[25, 6, 27, 41] and tuning is limited. For EMNIST, we follow [25] and fix , , , client learning rate , server learning rate , and client batch size . For Shakespeare, we follow [6] and fix , , , , and , and we sweep . For SO-NWP, we follow [27] and fix , , , , and , and we sweep

and limit max examples per client to 256. In all cases, clients train for 1 epoch on their local datasets, and the client updates are weighted uniformly (as opposed to weighting by number of examples). See Appendix for more results and full details on datasets, models, and hyperparameters.

Results Figure 5 summarizes the FL experiments. For EMNIST and Shakespeare, we report the average test accuracy over the last 100 rounds. For SO-NWP, we report the top-1 accuracy (without padding, out-of-vocab, or begining/end-of-sentence tokens) on the test set. The results indicate that Skellam performs as good as Gaussian despite relying on generic RDP amplification via sampling [51] (cf. Fig. 3) and that Skellam matches DDG consistently under realistic regimes. This bears significant practical relevance given the advantages of Skellam over DDG in real-world deployments.

Figure 5: Federated learning with the distributed Skellam mechanism. DDGauss: Distributed Discrete Gaussian [25]. Left / Middle / Right: Test accuracies on EMNIST / Shakespeare / Stack Overflow NWP across different and . is set to , , , respectively. For Shakespeare, privacy is reported with a hypothetical population size .

6 Conclusion

We have introduced the multi-dimensional Skellam mechanism for federated learning. We analyzed the Skellam mechanism through the lens of approximate DP, privacy loss distributions, and Rényi divergences, and derived a sharp RDP bound that enables Skellam to match Gaussian and discrete Gaussian in practical settings as demonstrated by our large-scale experiments. Since Skellam is closed under summation and efficient samplers are widely available, it represents an attractive alternative to distributed discrete Gaussian as it easily extends from the central DP model to the distributed DP model. Being a discrete mechanism can also bring potential communication savings over continuous mechanisms and make Skellam less prone to attacks that exploit floating-point arithmetic on digital computers. Some interesting future work includes: (1) our scalar PLD analysis for Skellam suggests room for improvements on our multi-dimensional analysis via a complete PLD characterization, and (2) our results on FL may be further improved via a targeted analysis for RDP amplification via sampling akin to 

[37]. Overall, this work is situated within the active area of private machine learning and aims at making ML more trustworthy. One potential negative impact is that our method could be (deliberately or inadvertently) misused, such as sampling the wrong noise or using a minuscule scaling factor, to provide non-existent privacy guarantees for real users’ data. We nevertheless believe our results have positive impact as they facilitate the deployment of differential privacy in practice.


Appendix A Proof of Lemma 3.7

Before moving forward we state the following lemma which follows via a simple calculation.

Lemma A.1.

For any positive real number and any , we have that

Further for any positive reals we have that

where as defined in Lemma 3.8.


The first inequality follows easily from the definition of and by noting that the function . For the second inequality, by the definition of we have that

Now, consider the scalar function for . Note that the function is monotonically increasing, concave and has values between with . Putting these facts together we have that for any

Using Lemma 3.8 and Lemma A.1 we have the following Lemma

Lemma A.2.

Given two non-negative integers we have that

We are now ready to provide the proof of Lemma 3.7.

Proof of Lemma 3.7.

We prove the statement for , a similar analysis applies for the case by switching to .

Since 3.8 applies only in the case when is positive, we need to handle the negative case via noting that for integer . This necessitates the requirement for multiple cases. We begin with the first case

Case 1 -

In this case replacing setting we get that


where we know that . Now consider the following calculation.

where the first inequality follows from Lemma 3.8 and the second inequality follows from the fact that for all we have that and the third inequality follows from Lemma A.1.

Case 2 -

In this case replacing setting we get that


where we know that . Now consider the following calculation.

where the first inequality follows from Lemma 3.8 and the second inequality follows from the fact that for all we have that and the third inequality follows from Lemma A.1.

Case 3 -

In this case we first note that

Next consider the following calculation which corresponds to applying Lemma A.2 to the above expression we get that,