I Introduction
The rise of data analytics and machine learning (ML) presents countless opportunities for companies, governments and individuals to benefit from the accumulated data. At the same time, their ability to capture fine levels of detail potentially compromises privacy of data providers. Recent research
[16, 33] suggests that even in a blackbox setting it is possible to argue about the presence of individual records in the training set or recover certain features of these records.To tackle this problem a number of solutions has been proposed. They vary in how privacy is achieved and to what extent data is protected. One approach that assumes privacy at its core is federated learning (FL) [23]. In the FL setting, a central entity (server) trains a model on user data without actually copying data from user devices. Instead, users (clients) update models locally, and the server aggregates these updates.
In spite of all the advantages, federated learning does not provide theoretical privacy guarantees, like it is done by differential privacy (DP) [15], which is viewed by many researchers as the privacy gold standard. Initially, DP algorithms focused on sanitising simple statistics, such as mean, median, etc., using a technique known as output perturbation. In recent years, the field made a lot of progress towards the goal of privacypreserving machine learning, through works on objective perturbation [10], stochastic gradient descend with DP updates [34], to more complex and practical methods [1, 29, 30, 24].
As shown in recent work [24, 18], the two approaches can be combined to provide joint benefits. However, unless the number of users is exceedingly high (e.g. in the scenario of a large population of mobile users considered in [24]), differentially private federated learning provides only weak guarantees. Contrary to a widespread opinion in machine learning community, values of close to can hardly be seen as reassurance to a user: for certain types of attacks, an adversary can theoretically reach accuracy of .
We propose to augment federated learning with a natural relaxation of differential privacy, called Bayesian differential privacy (BDP) [35]
, that provides tighter, and thus, more meaningful guarantees. The main idea of this relaxation is based on the observation that machine learning tasks are often restricted to a particular type of data (for example, finding a film review in the MRI dataset is very unlikely). Moreover, this information, and potentially even some prior distribution of data, is often available to the attacker. While the traditional DP treats all data as equally likely and hides differences by large amounts of noise, BDP calibrates noise to the data distribution. Hence, for any two datasets drawn from the same (arbitrary) distribution, and given the same privacy mechanism with the same amount of noise, BDP provides tighter guarantees than DP. Note that the full knowledge of this distribution is not required, as the necessary statistics can be estimated from data.
We introduce the notion of Bayesian differential privacy in Section IV and extend it to the federated learning setting in Section V. Our experiments (see Section VI) show significant advantage, both in privacy guarantees and the model quality.
The main contributions of this paper are the following:

we adapt the notion of Bayesian differential privacy to federated learning, including more natural noni.i.d. settings (Section VA), to provide strong theoretical privacy guarantees under minor and practical assumptions;

we propose a novel joint accounting method for estimating clientlevel and instancelevel privacy simultaneously and securely (Section VC);

we experimentally demonstrate advantages of our method, such as shrinking the privacy budget to a fraction of the previous stateoftheart, and improving the accuracy of the trained models by up to (Section VI).
Ii Related Work
As machine learning applications become more and more common, various vulnerabilities and attacks on ML models get discovered, based on both passive (for example, model inversion [16] and membership inference [33]) and active adversaries (e.g. [21]), raising the need for developing matching defences.
Differential privacy [15, 12] is one of the strongest privacy standards that can be employed to protect ML models from these and other attacks. Since pure DP is hard to achieve in many realistic learning settings, a notion of approximate DP is used acrosstheboard in machine learning. It is often achieved as a result of applying the Gaussian noise mechanism [13].
For a long time, however, even approximate DP remained unachievable in more popular deep learning scenarios. Some earlier attempts [32] led to prohibitively high bounds on [1, 29] that were later shown to be ineffective against attacks [21]. A major step in the direction of bringing privacy loss values down to more practical magnitudes was done by [1] with the introduction of the moments accountant, currently a stateoftheart method for keeping track of the privacy loss during training. Followed by improvements in differentially private training techniques [29, 30], it allowed to achieve singledigit DP guarantees (
) for classic supervised learning benchmarks, such as MNIST, SVHN, and CIFAR.
On the other end of spectrum, McMahan et al. [23]
proposed federated learning as one possible solution to privacy issues (among other problems, such as scalability and communication costs). In this setting, privacy is enforced by keeping data on user devices and only submitting model updates to the server. Two of the popular approaches are the federated stochastic gradient descent (
FedSGD) and federated averaging (FedAvg) [23], where clients do local ondevice gradient descent using their data, then send these updates to the server, which applies an average update to the model. Privacy can further be enhanced by using secure multiparty computation (MPC) [39] to allow the server access only average updates of a big group of users and not individual ones. However, MPC or homomorphic encryption do not guarantee robustness against model inversion or membership inference [16, 33], because these attacks operate on the resulting model which remains the same. Alternatively, by using differential privacy and the moments accountant, [24] and [18] attained theoretical clientlevel privacy guarantees for federated learning settings.Since federated learning on its own lacks theoretical privacy guarantees, combining it with a more formal notion of privacy is an attractive direction of research. On the other hand, for more complicated deep learning models, differential privacy leads to a poor privacyutility tradeoff, suggesting that pairing federated learning with an alternative notion might prove beneficial.
Apart from differential privacy, a number of alternative definitions have been proposed over the recent years, such as computational DP [26], mutualinformation privacy [25, 38], different versions of concentrated DP (CDP [14], zCDP [7], tCDP [6]), and Rényi DP (RDP) [27]. Some other relaxations [2, 31, 9] tip the balance even further in favour of applicability at the cost of weaker guarantees, for example considering the averagecase instead of the worstcase [36].
In general, important aspects of a privacy notion are composability, accountability, and interpretability. Apart from sharp bounds, the moments accountant is attractive because it operates within the classic notion of DP. Some of the alternative notions of DP also provide tight composition theorems, along with some other advantages, but to the best of our knowledge, they are not broadly used in practice compared to traditional DP (although there are some examples [17]). One of the possible reasons for that is interpretability: parameters of RDP or CDP are hard to interpret. While it may be difficult to quantify the actual guarantee provided by specific values of , of the traditional DP, it is still advantageous that they have a clearer probabilistic interpretation.
In this work, we rely on another relaxation, called Bayesian differential privacy [35]. This notion boosts privacy accounting efficiency by utilising the fact that data come from a particular distribution, and not all data are equally likely. At the same time, it maintains the probabilistic interpretation of its parameters and . It is worth noting, that unlike some of the relaxations mentioned above, the notion of Bayesian DP provides the worstcase guarantee (under specified conditions) and is not limited to a particular dataset, but rather a particular type of data (e.g. emails, MRI images, etc.), or a mixture of such types, which is a much more permitting assumption.
Iii Preliminaries
In this section, we provide necessary definitions, background and notation used in the paper. We also describe the general setting of the problem.
Iiia Definitions and notation
We use to represent neighbouring (adjacent) datasets. If not specified, it is assumed that these datasets differ in a single example. Individual examples in a dataset are denoted by or , while the example by which two datasets differ—by . We assume , whenever possible to do so without loss of generality.
Since this paper mainly deals with adding noise to gradients (w.r.t. model parameters, neural network weights, etc.), we often refer to nonnoised gradients as
nonprivate outcome, and denote it by . The private learning outcomes are denoted by . Whenever it is ambiguous, we denote expectation over data (or equivalently, gradients) as , and over the learning outcomes as . Finally, for federated learning scenarios, indicates an update of a user , while —a set of all user updates.Definition 1.
A randomised function (algorithm) with domain and range satisfies differential privacy if for any two adjacent inputs and for any set of outcomes the following holds:
Definition 2.
The privacy loss of a randomised algorithm for outcome and datasets is given by:
For notational simplicity, we omit the designation , i.e. we use (or simply
) for the privacy loss random variable, and
, andfor the outcome probability distributions for given datasets. Also, note that the privacy loss random variable
is distributed by drawing (see [14, Section 2.1 and Definition 3.1]), which helps linking it to wellknown divergences.Definition 3.
The Gaussian noise mechanism achieving DP, for a function , is defined as
where and is the L2sensitivity of .
For more details on differential privacy and the Gaussian mechanism, we refer the reader to [13].
We will also need the definition of Rényi divergence:
Definition 4.
Rényi divergence of order between distributions and , denoted as is defined as
where and are corresponding density functions of and .
IiiB Setting
In the first part of the paper, while describing the concept of Bayesian differential privacy, we consider a general iterative learning algorithm, such that each iteration produces a nonprivate learning outcome (e.g. a gradient over a batch of data). In the second half of the paper, we consider the equivalent federated learning setting, where each communication round produces a set of nonprivate learning outcomes , one for each client .
The nonprivate outcome, in both scenarios, gets transformed into a private learning outcome that is used as a starting point for the next iteration or communication round. The learning outcome can be made private by different means, but in this work we consider the most common approach of applying an additive noise mechanism (e.g. a Gaussian noise mechanism). We denote the distribution of private outcomes by (we assume the Markov property of the learning process for brevity of notation, but it is not necessary in general) or , depending on the scenario.
The process can run on subsamples of data or subsets of clients, in which case comes from the distribution , where is a batch of data used for parameters update in iteration , or , where is a set of updates from users participating in the communication round . In these cases, privacy is amplified through sampling [4].
For each iteration, we would like to compute a quantity (we call it a privacy cost) that accumulates over the learning process and allows to compute privacy loss bounds using concentration inequalities. The overall privacy accounting workflow does not significantly differ from prior work, but is in fact a generalisation of the wellknown moments accountant [1]. Importantly, it is not tied to a specific learning algorithm or a class of algorithms, as long as one can map it to the above setting.
IiiC Motivation
Before we proceed, we find it important to motivate the research and usage of alternative definitions of privacy. The primary reason for this is that the complexity of the concept of differential privacy often leads to misunderstanding or overestimation of the guarantees it provides. And while we do not fully tackle the problem of interpretability, we provide a simple example below that allows to better judge the quality of provided guarantees.
Consider the stateoftheart differentially private machine learning models [1, 30]. In order to come close to the nonprivate accuracy (say within of it), all of the reported models stretch their privacy budget to (for a reasonably low ), while in many cases it goes up to . In realworld applications, it can even be larger than ^{1}^{1}1https://www.wired.com/story/appledifferentialprivacyshortcomings/. These numbers seem small, and thus, may often be overlooked. But let us present an alternative interpretation.
What we are interested in is the change in the posterior distribution of the attacker after they see the private model compared to prior [27, 8]. Let us consider the stronger, pure DP for simplicity. According to the definition of DP:
Assume the following specific example. The datasets consist of income values for residents of a small town. There is one individual whose income is orders of magnitude higher than the rest, and whose residency in the town is what the attacker wishes to infer. The attacker observes the mean income sanitised by a differentially private mechanism with . If the individual is not present in the dataset, the probability of being above a certain threshold is extremely small. On the contrary, if is present, this probability is higher (say it is equal to ). The attacker takes a Bayesian approach, computes the likelihood of the observed value under each of the two assumptions and the corresponding posteriors given a flat prior. The attacker then concludes that the individual is present in the dataset and is a resident.
By the above expression, can only be times larger than the corresponding probability without . But if the is small enough, then the probability of the attacker’s guess being correct is as high as or, equivalently,
(1) 
To put it in perspective, for a DP algorithm with , the upper bound on the accuracy of this attack is as high as . For , it is . For , . Remember that we used an uninformative flat prior, and for a more informed attacker these numbers could be even larger.
In a more realistic scenario, even without any privacy protection, this high accuracy is not likely to be achieved by the attacker. So such guarantee is hardly better than no guarantee, and cannot be seen as reassuring. Thus, we want to encourage the discussion and search for more meaningful privacy definitions or DP relaxations for machine learning and federated learning. One such relaxation we present and explore in this paper.
Iv Bayesian Differential Privacy
In this section, we describe Bayesian differential privacy (BDP), accompanied by a practical privacy loss accounting method. We restate just the main results necessary for the following section, while all the details, proofs and experimental evaluation of BDP can be found in [35].
Definition 5.
A randomised function (algorithm) with domain and range satisfies Bayesian differential privacy if for any two adjacent datasets , differing in a single data point , and for any set of outcomes the following holds:
(2) 
To derive tighter sequential composition, we use an alternative definition that implies the above:
(3) 
where probability is taken over the randomness of the outcome and the additional example .
This definition is very close to the original definition of DP, except that it also takes into account the randomness of . Hence, the basic properties are similar to those of DP, although BDP does not provide guarantees in all the scenarios where DP does (e.g. when the distribution is nonstationary, it is possible that BDP underestimates the actual privacy loss).
While Definition 5 does not specify the distribution of any point in the dataset other than the additional example , it is natural and convenient to assume that all examples in the dataset are drawn from the same distribution . This holds in many realworld applications, including all applications evaluated in this paper, and it allows using sampling techniques instead of requiring knowing the true distribution.
We also assume that all data points are exchangeable [3], i.e. any permutation of data points has the same joint probability. It enables tighter accounting for iterative mechanisms, and is naturally satisfied in the considered scenarios.
Since basic composition is not enough to provide tight privacy bounds for iterative federated learning algorithms in which we are interested, let us present two theorems, generalising upon the moments accountant routine. We use it as a foundation for the privacy accounting framework for federated learning presented in the next section.
Theorem 1 (Advanced Composition).
Let a learning algorithm run for iterations. Denote by a sequence of private learning outcomes obtained at iterations , and the corresponding total privacy loss. Then,
where , , and is Rényi divergence between and .
We denote the logarithm of the quantity inside the product in Theorem 1 as and call it the privacy cost of the iteration, or communication round, :
(4) 
The privacy cost of the whole learning process is then a sum of the costs of each iteration.
Theorem 2.
Let the algorithm produce a sequence of private learning outcomes using a known probability distribution . Then, and are related as
Corollary 1.
Under the conditions above, for a fixed :
Theorem 2 provides an efficient privacy accounting algorithm. During training, we compute the privacy cost for each iteration , accumulate it, and then use to compute pair. This process is ideologically close to that of the moment accountant, but accumulates a different quantity (note the change from the privacy loss random variable to Rényi divergence and expectation over data).
Computing precisely requires access to the prior distribution of data , which is unrealistic. Therefore, we need an estimator for . Moreover, since Chernoff bound, which our Theorem 2 is based on, only holds for the true expectation value, we have to take into account the estimator error. To solve this, we employ a Bayesian view of the estimation problem [28] and use the upper confidence bound of the expectation estimator.
Let us define the following sample estimator of :
(5) 
where and
are the sample mean and the sample standard deviation of
, is the inverse of the Student’s distribution CDF at with degrees of freedom, andOne can show that, for continuous distributions and , overestimates the true privacy cost with probability . Therefore, the probability of underestimation can be fixed upfront and incorporated in .
Remark 1.
This step changes the interpretation of in Bayesian differential privacy compared to the traditional DP. Apart from the probability of the privacy loss exceeding , e.g. in the tails of its distribution, it also incorporates our uncertainty about the true data distribution (in other words, the probability of underestimating the true expectation because of not observing enough data samples). It can be intuitively understood as accounting for unlikely or unobserved data in , rather than in by adding more noise.
Remark 2.
For discrete outcomes, a different estimator needs to be derived to meet the same bound. The process of obtaining it is identical to the one above, with the only change in the maximum entropy distribution.
Remark 3.
There are other differences compared to the classic DP, such as allowance for unbounded sensitivity or estimator privacy, that are not discussed in this work. More details can be found in [35].
Gaussian Noise Mechanism. Consider the subsampled Gaussian noise mechanism [13]. The outcome distribution in this case is the mixture of two Gaussians , where and are nonprivate outcomes at the iteration (e.g. gradients), is the noise parameter, and is the data sampling probability. Plugging the outcome distribution into the formula for Rényi divergence, we get the following result for the privacy cost.
Theorem 3.
Given the Gaussian noise mechanism with the noise parameter and subsampling probability , the privacy cost for at iteration can be expressed as
where
and
is the binomial distribution with
experiments and the probability of success .V Federated Learning with Bayesian Differential Privacy
In this section, we adapt the Bayesian differential privacy framework and its accounting method to guarantee the clientlevel privacy, the level most frequently addressed in the literature. We then justify and explore the instancelevel privacy and two different techniques for accounting it. Finally, we propose a method to jointly account instancelevel and clientlevel privacy for the FedSGD algorithm in order to provide the strongest tradeoff between utility and privacy guarantees.
Va Client privacy
When it comes to reinforcing federated learning with differential privacy, the foremost attention is given to the clientlevel privacy [24, 18]. The goal is to hide the presence of a single user, or to be more specific, to bound the influence of any single user on the learning outcome distribution (i.e. the distribution of the model parameters).
Under the classic DP [24, 18], the privacy is enforced by clipping all user updates to a fixed norm threshold
and then adding Gaussian noise with the variance
. The noise parameter is calibrated to bound the privacy loss in each communication round, and then the privacy loss is accumulated across the rounds using the moments accountant [1].We use the same privacy mechanism, but employ the Bayesian accounting method instead of the moments accountant. Intuitively, our accounting method should have a significant advantage over the moments accountant in the settings where data is distributed similarly across the users because in this case their updates would be in a strong agreement. In order to map the Bayesian differential privacy framework to this setting, let us introduce some notation.
Let denote the number of clients in the federated learning system. Every client computes and sends to the server a model update drawn from the client’s update distribution . Considering individual client distributions ensures that our approach is applicable to noni.i.d. settings that are natural in the federated learning context. Generally, not all users participate in a given communication round. We denote the probability of a user participating in the round by . Thus, the overall update distribution is given by a mixture:
(6) 
In our experiments, we fix .
To match the notation above, let indicate the privacypreserving model update:
(7) 
where in the case of Gaussian mechanism, and is the set of updates from users participating in the round .
To bound and of Bayesian differential privacy, one needs to compute , where
and
Since the randomness of comes from the subsampled Gaussian noise mechanism, we use Theorem 3, in combination with user sampling and the estimator (5) for both expressions, to obtain that upperbounds with high probability.
VB Instance privacy
As noted above, the same privacy mechanism can be used in conjunction with the moments accountant to get the classic DP guarantees [24, 18]. In this case, DP at the client level implies the same guarantee at the instance level (i.e. bounding the influence of a single data point). However, it does not hold for BDP. Moreover, the same privacy guarantee may not be meaningful at the instance level. For example, might be reasonable for clients, but if a client has tens of thousands of data points, it is not a reasonable failure probability at the data point level.
At the same time, instance privacy is extremely important in some scenarios. Imagine federated training on medical data from different hospitals: while a hospital participation may be public knowledge, individual patients data must be protected at the highest degree. Another reason for considering instancelevel privacy is that it provides an additional layer of protection for users in case of an untrusted curator.
In order to get tighter instance privacy guarantees, we apply the subsampled Gaussian noise mechanism to gradient computation on user devices. The accounting follows the same procedure as described above, except that the noise parameter and the sampling probability may be different, depending on which of the settings described below is used.
There are two possible accounting schemes:
VB1 Sequential accounting
Part of the accounting is performed locally on user devices and part on the server. Overall privacy cost is equivalent to the centralised training with the data sampling probability , where is the total number of data points across all users, and is the local batch size.
The process proceeds as follows. At each communication round, the server sends to participating clients, every client performs private gradient updates, computes , and sends it to the server. The server then aggregates the sum of from all users. Since the privacy costs are datadependent, it is possible to use secure multiparty computation to allow the server know the sum without learning individual costs.
The disadvantage of this method is that every participating client learns the total number of data points, and especially in the settings with a small number of users it may not be desirable. Furthermore, the obtained bounds apply to the commonly learnt model but not to the individual updates of each user, requiring them to maintain a separate local bound. These issues are addressed by parallel accounting.
VB2 Parallel accounting
In this scheme, every client computes using , where is the local dataset size of the client. Consequently, since , the privacy costs will be higher. But this is compensated by using parallel composition instead of sequential: the server aggregates the maximum of over all users. Again, using secure multiparty computation is possible to prevent the server from learning individual privacy costs.
Parallel composition is applicable in this scenario because user updates within a single round are independent. However, the server needs to sum up maximum privacy costs over the rounds because updates are dependent on previous rounds.
As we show in Section VI, parallel accounting may require more communication rounds to converge to the same quality solution with the same privacy guarantee. The gap is more notable on nonidentically distributed data.
VC Joint privacy
Instance privacy provides tighter and more meaningful guarantees for every data point contribution to the trained model. Nevertheless, there is a downside: adding noise both during ondevice gradient descent as well as during the averaging phase on the server results in slow convergence or complete divergence of the federated learning algorithm.
To tackle this problem, we propose joint accounting, where the noise added on the client side is recounted towards the clientlevel privacy guarantee.
The main idea of joint accounting is that a client update received by the server is already noisy when instance privacy is enforced, and instead of adding more noise the server can recount existing noise to compute the clientlevel bound^{2}^{2}2A version of this technique for differential privacy has also been explored in a master’s thesis project with Nikolaos Tatarakis.. The only problem: the server cannot sample nonprivate client updates to estimate because it no longer has access to their distribution.
Fortunately, the inner expectation in and can be computed locally, suggesting the following procedure. Every client computes (Eq. 5) with and being the private outcome distributions with and without their entire update. Then, the server computes , , and by simple averaging. Additionally, one can implement this averaging step with secure multiparty computation to further privacy protection. For the moment, however, it can only be used with FedSGD, and not FedAvg, because every noisy step in FedAvg would change the point at which the gradient is computed, potentially leading to a different gradient distribution or underestimated total noise variance.
Using joint accounting allows to achieve tight instance and client privacy guarantees, and at the same time, preserve the speed of convergence almost at the same level as the clientprivacyonly solution (see Section VID).
Vi Evaluation
In this section, we provide results of the experimental evaluation of our approach. We begin by describing the datasets we used, as well as the setting details shared by all experiments. The subsequent structure follows that of the previous section. We first evaluate the clientlevel privacy by comparing accuracy and privacy guarantees of the traditional DP method [18] to ours (Section VIB). Then, in Section VIC, we perform experiments on the two proposed methods of instance privacy accounting. Finally, Section VID describes the results of the joint accounting approach.
Via Experimental setup
We perform experiments on two datasets. The first dataset is MNIST [22]. It is a standard image classification task widely used in machine learning research. More specifically, it is a handwritten digit recognition dataset consisting of 60000 training examples and 10000 test examples, each example is a 28x28 greyscale image. The second is the APTOS 2019 Blindness Detection challenge dataset^{3}^{3}3https://www.kaggle.com/c/aptos2019blindnessdetection/overview/description (in figures, tables and text, we refer to this dataset as Retina or APTOS). It consists of 3662 retina images taken using fundus photography. The images are labelled by clinicians to reflect the severity of diabetic retinopathy on the scale from 0 to 4. Unlike other datasets commonly evaluated in the privacy literature [1, 24, 18], this one actually has more serious implications of a privacy leak.
All experiments have the following general setup. There is a number of clients (100, 1000, or 10000), each holding a subset of data, and the server that coordinates federated training of the shared model. Some setups with a higher number of users will entail repetition of data, like in [18], which is a natural scenario in some applications, e.g. shared or very similar images on different smartphones. In MNIST experiments, each user holds 600 examples. For the APTOS dataset, we use data augmentation techniques (e.g. random cropping, resizing, etc.) to obtain a larger training set, and then split it such that every client gets 350 images.
Testing is performed on the official test split for MNIST, and on the first 500 samples in case of APTOS.
While the parameters of the training vary based on experiments, the models remain the same. For MNIST, we use a simple CNN with two convolutional layers and two fully connected layers (similar to the one described in the TensorFlow tutorial
^{4}^{4}4https://www.tensorflow.org/tutorials/images/deep_cnn). In case of APTOS, due to the small dataset size and a harder learning task, we employ ResNet50 [20]pretrained on ImageNet
[11] and retrain only the last fullyconnected layer of the network. We do not do extensive hyperparameter tuning in general, since we are interested in relative performance of private models compared to nonprivate ones rather than the best classification accuracy, and thus, our nonprivate baseline results may not always match the ones reported in [23]. For the same reason, we restrict the number of communication rounds () and use FedSGD instead of FedAvg, although all the methods except for joint accounting are compatible with FedAvg.One of the important aspects of federated learning is that data might not be distributed identically among users. In agreement with previous work [23, 18], we include experiments in both i.i.d. and noni.i.d. settings for MNIST, because it allows for a natural nonidentical split. More specifically, in the i.i.d. setting, every user is assigned a subset of uniformly sampled examples. In the noni.i.d. setting, we follow the same scheme as [23] and [18]: splitting the dataset on shards of 300 points within the same class and then assigning 2 random shards to each client. The scenario of 100 clients with nonidentically distributed data is particularly hard for privacy applications: there are possible digit combinations that clients can hold and only 100 clients, meaning that some clients might be easily distinguishable by their data distribution. Therefore, it is important to note that it may not be possible to obtain a reasonable privacy bound in this scenario without seriously compromising accuracy.
The privacy accounting is performed by two methods. To obtain the bounds on and of differential privacy, we use the moments accountant [1], the stateoftheart DP accounting method. In the case of Bayesian differential privacy, we follow the technique described in Sections IV and V: we sample a number of user updates (or gradients for instance privacy), estimate an upper bound on the privacy cost, and use Chernoff inequality to compute the corresponding pair of .
ViB Client privacy
Accuracy  Privacy  

Clients  Baseline  DP  BDP  DP  BDP 
100  
1K  
10K 
Accuracy  Privacy  

Clients  Baseline  DP  BDP  DP  BDP 
100  
1K  
10K 
In this experiment, we test adding client privacy the same way it is done in [24] and [18]. We use Bayesian accounting, as described in Section VA, and compare it to the classic differential privacy accounting by the moments accountant [1]. We fix the noise level and account DP and BDP in parallel.
summarise accuracy and privacy guarantees obtained in this setting for MNIST (noni.i.d. and i.i.d.) and APTOS respectively. The first column indicates the number of clients, the second—the baseline accuracy of a nonprivate federated classifier (models described in the previous section). The following columns contain accuracy and privacy parameters obtained for private models using the classic DP and BDP. Despite being trained in parallel, the two techniques may differ in accuracy because in some cases we do early stopping for DP to prevent exceeding privacy budget.
Accuracy  Privacy  

Clients  Baseline  DP  BDP  DP  BDP 
100  
1K  
10K 
In all cases and for all datasets, we observe substantial benefits of using Bayesian accounting. The accuracy gains are most notable in the noni.i.d. setting of MNIST, where our method can achieve higher accuracy in the 100 clients setting, because it presents a more difficult learning scenario as explained in the previous section. The privacy gains are consistently significant across all datasets and settings, and taking into account the fact that is exponentiated to get the bound on outcome probability ratios, BDP can reach times stronger guarantee. Nevertheless, in the settings with few clients, even Bayesian differential privacy does not reach a more comfortable guarantee of , suggesting that a better privacyaccuracy tradeoff may not be feasible due to higher clients identifiability, or that more work is needed in improving training with noise and developing novel privacy mechanisms for federated learning.
Importantly, there is no computation or communication overhead from the users’ point of view in these experiments since the privacy accounting code is executed on the server.
ViC Instance privacy
Accuracy  Privacy  

Dataset  Baseline  DP  BDP  Client  Instance 
APTOS 2019  
MNIST (iid)  
MNIST (noniid) 
As noted in Section VB, instance privacy is very important in scenarios like collecting medical data from a number of hospitals where patient privacy is at least as crucial as hospital privacy. In this section, we compare two accounting methods proposed earlier: sequential and parallel accounting.
Depicted in Figure 1 are the curves showing the growth of estimate with communication rounds. We subtracted the initial value and applied logarithmic scale in order to better show the difference in the rate of growth. Across all settings, it can be seen that parallel accounting leads to faster growth rates, despite the fact that the parallel composition is more efficient (taking maximum over clients instead of a sum). This behaviour can be explained by the fact that each client is unaware of the total dataset size and, having a small number of data points, is convinced that every data example has significant influence on the outcome. The unawareness about other clients in the case of parallel accounting can also explain the fact that we don’t observe any improvement in the growth rate with increasing the number of clients. The only exception is the noni.i.d. MNIST experiment, where the difference is likely to come from increased stability of training and decreased gradient variability with more clients.
The main takeaway from this experiment is that it is beneficial to use sequential accounting for privacy of the federated model whenever communicating the total size of the dataset to users is acceptable. In other cases, and for personal privacy accounting in case of the untrusted curator, parallel accounting can be used, but more noise is necessary for reasonable privacy guarantees.
ViD Joint privacy
Lastly, we would like to test the proposed method of the joint accounting of instancelevel and clientlevel privacy and contrast it to accounting at these two levels separately. We perform experiments in the same settings as above, fixing the client privacy at a certain level () and evaluating the speed and quality of training. We also compare to what can be achieved by introducing privacy only at the client level.
Figure 2 displays the test accuracy evolution over communication rounds in the setup of 100 clients for APTOS and 1000 clients for MNIST. The graphs contain curves for training without privacy, clientlevelonly privacy, and the two accounting paradigms: joint and split. As expected, the nonprivate training achieves the best accuracy. Nevertheless, in the i.i.d. setting, clientonly private training quickly approaches nonprivate training in quality. Notably, training with both instance and client privacy using the joint accounting performs nearly as well, while training with the separate accounting completely fails due to excessive amounts of noise at both instance and client levels. For the noni.i.d. setting, private training is slower, but there is little difference between introducing privacy only at the client level and using the joint accounting at both levels: after a slightly larger number of rounds, training with the joint accounting reaches similar performance. Based on these experiments, we conclude that by using joint accounting we can introduce instance privacy on clients and get clientlevel privacy at almost no cost.
Finally, we evaluate our method in the strong privacy setting. We set the instancelevel privacy to and stop training when the client privacy reaches the level similar to previous experiments (except APTOS dataset, where we were able to achieve comparable results with lower privacy cost), and report the accuracy that can be achieved in this strict setting. We have also chosen the most difficult scenario of 100 clients. As seen in Table IV, the algorithm with differential privacy performs very poorly on APTOS dataset, and fails to learn on MNIST, in both i.i.d. and noni.i.d. setting. Its performance is especially affected by the strict instance privacy requirement, since such low levels of
necessitate large quantities of noise to be added. It is worth noting, that it might be possible to achieve better results with DP by performing perexample gradient clipping, as in
[1], but we do not use this technique due to its impracticality.On the other hand, our approach manages to achieve reasonable accuracy even under such a strong privacy guarantee. On APTOS dataset, it is just lower than the nonprivate baseline, while on MNIST, it correctly classifies more than of the test data in the i.i.d. setting and over in the noni.i.d. setting. One could potentially add more noise on the server and combine the accounting with the instance level noise to slow down the growth of and reach even better performance, but we leave these experiments for future work.
Both instance and joint privacy accounting add some computation overhead on user devices due to multiple gradient calculations. However, performing FL routines when devices are idle and charging, as suggested in [5], alleviates this problem. Communication overhead is negligible because only a single floating point number is added to user messages.
Vii Conclusion
We employ the notion of Bayesian differential privacy, a relaxation of differential privacy, to obtain tighter privacy guarantees for clients in the federated learning settings. The main idea of this approach is to utilise the fact that users come from a certain population with similarly distributed data, and therefore, their updates will likely be in agreement with each other. This is a meaningful assumption in many machine learning scenarios because they target a specific type of data (e.g. medical images, emails, motion sensor data, etc.). For example, it may be unjustified to try hiding an absence of an audio record in a training set for the ECG analysis, since the probability of it appearing is in fact much smaller than .
We adapt an efficient and tight privacy accounting method for Bayesian differential privacy to the federated setting in order to estimate client privacy guarantees. Moreover, we emphasise the importance of instancelevel privacy and propose two variants of privacy accounting at this level. Finally, we introduce a novel technique of joint accounting suitable for obtaining privacy guarantees at instance and client levels jointly from only instancelevel noise.
Our evaluation provides evidence that Bayesian differential privacy is more appropriate for federated learning. First, it requires significantly less noise to reach the same privacy guarantees, allowing models to train in fewer communication rounds. Second, the bounds on privacy budget are much tighter, and thus, more meaningful. When the number of clients reaches an order of thousands, which is realistic in many federated learning scenarios, can be kept below . Finally, we demonstrate that by using joint accounting we can get client privacy for free when adding instance privacy. This way, the privacy budget can be kept close to for client privacy and for instance privacy while maintaining reasonably high accuracy.
An important future direction of research is automatically detecting and mitigating scenarios in which tighter privacy guarantees are inapplicable, such as nonstationary data distributions or datasets with nonexchangeable samples.
References
 [1] (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §I, §II, §IIIB, §IIIC, §VA, §VIA, §VIA, §VIB, §VID.

[2]
(2013)
Differential privacy applications to bayesian and linear mixed model estimation
. Journal of Privacy and Confidentiality 5 (1), pp. 4. Cited by: §II.  [3] (1985) Exchangeability and related topics. In École d’Été de Probabilités de SaintFlour XIII—1983, pp. 1–198. Cited by: §IV.
 [4] (2018) Privacy amplification by subsampling: tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems, pp. 6280–6290. Cited by: §IIIB.
 [5] (2019) Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046. Cited by: §VID.

[6]
(2018)
Composable and versatile privacy via truncated cdp.
In
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing
, pp. 74–86. Cited by: §II.  [7] (2016) Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635–658. Cited by: §II.
 [8] (2017) A teaser for differential privacy. Cited by: §IIIC.
 [9] (2017) On the meaning and limits of empirical differential privacy. Journal of Privacy and Confidentiality 7 (3), pp. 3. Cited by: §II.
 [10] (2011) Differentially private empirical risk minimization. Journal of Machine Learning Research 12 (Mar), pp. 1069–1109. Cited by: §I.

[11]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §VIA.  [12] (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §II.
 [13] (2014) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §II, §IIIA, §IV.
 [14] (2016) Concentrated differential privacy. arXiv preprint arXiv:1603.01887. Cited by: §II, §IIIA.
 [15] (200607) Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), Vol. 4052, Venice, Italy, pp. 1–12. External Links: Link, ISBN 3540359079 Cited by: §I, §II.
 [16] (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333. Cited by: §I, §II, §II.
 [17] (2017) Renyi differential privacy mechanisms for posterior sampling. In Advances in Neural Information Processing Systems, pp. 5289–5298. Cited by: §II.
 [18] (2017) Differentially private federated learning: a client level perspective. arXiv preprint arXiv:1712.07557. Cited by: §I, §II, §VA, §VA, §VB, §VIA, §VIA, §VIA, §VIB, §VI.
 [19] (2013) Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249, pp. 124–131. Cited by: §IIIA.
 [20] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §VIA.
 [21] (2017) Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 603–618. Cited by: §II, §II.
 [22] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §VIA.
 [23] (2016) Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §I, §II, §VIA, §VIA.
 [24] (2017) Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963. Cited by: §I, §I, §II, §VA, §VA, §VB, §VIA, §VIB.
 [25] (2012) Informationtheoretic foundations of differential privacy. In International Symposium on Foundations and Practice of Security, pp. 374–381. Cited by: §II.
 [26] (2009) Computational differential privacy. In Annual International Cryptology Conference, pp. 126–142. Cited by: §II.
 [27] (2017) Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th, pp. 263–275. Cited by: §II, §IIIC.
 [28] (2006) A bayesian perspective on estimating mean, variance, and standarddeviation from data. Cited by: §IV.
 [29] (2016) Semisupervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755. Cited by: §I, §II.
 [30] (2018) Scalable private learning with pate. arXiv preprint arXiv:1802.08908. Cited by: §I, §II, §IIIC.
 [31] (2015) A new method for protecting interrelated time series with bayesian prior distributions and synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society) 178 (4), pp. 963–975. Cited by: §II.
 [32] (2015) Privacypreserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §II.
 [33] (2017) Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 3–18. Cited by: §I, §II, §II.
 [34] (2013) Stochastic gradient descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pp. 245–248. Cited by: §I.
 [35] (2019) Bayesian differential privacy for machine learning. arXiv preprint arXiv:1901.09697. Cited by: §I, §II, §IV, Remark 3.
 [36] (2019) Generating artificial data for private deep learning. In Proceedings of the PAL: PrivacyEnhancing Artificial Intelligence and Language Technologies, AAAI Spring Symposium Series, CEUR Workshop Proceedings, pp. 33–40. External Links: ISSN 16130073, Link Cited by: §II.

[37]
(2014)
Rényi divergence and kullbackleibler divergence
. IEEE Transactions on Information Theory 60 (7), pp. 3797–3820. Cited by: §IIIA.  [38] (2016) On the relation between identifiability, differential privacy, and mutualinformation privacy. IEEE Transactions on Information Theory 62 (9), pp. 5018–5029. Cited by: §II.
 [39] (1982) Protocols for secure computations. In FOCS, Vol. 82, pp. 160–164. Cited by: §II.
Comments
There are no comments yet.