The rise of data analytics and machine learning (ML) presents countless opportunities for companies, governments and individuals to benefit from the accumulated data. At the same time, their ability to capture fine levels of detail potentially compromises privacy of data providers. Recent research[16, 33] suggests that even in a black-box setting it is possible to argue about the presence of individual records in the training set or recover certain features of these records.
To tackle this problem a number of solutions has been proposed. They vary in how privacy is achieved and to what extent data is protected. One approach that assumes privacy at its core is federated learning (FL) . In the FL setting, a central entity (server) trains a model on user data without actually copying data from user devices. Instead, users (clients) update models locally, and the server aggregates these updates.
In spite of all the advantages, federated learning does not provide theoretical privacy guarantees, like it is done by differential privacy (DP) , which is viewed by many researchers as the privacy gold standard. Initially, DP algorithms focused on sanitising simple statistics, such as mean, median, etc., using a technique known as output perturbation. In recent years, the field made a lot of progress towards the goal of privacy-preserving machine learning, through works on objective perturbation , stochastic gradient descend with DP updates , to more complex and practical methods [1, 29, 30, 24].
As shown in recent work [24, 18], the two approaches can be combined to provide joint benefits. However, unless the number of users is exceedingly high (e.g. in the scenario of a large population of mobile users considered in ), differentially private federated learning provides only weak guarantees. Contrary to a wide-spread opinion in machine learning community, values of close to can hardly be seen as reassurance to a user: for certain types of attacks, an adversary can theoretically reach accuracy of .
We propose to augment federated learning with a natural relaxation of differential privacy, called Bayesian differential privacy (BDP) 
, that provides tighter, and thus, more meaningful guarantees. The main idea of this relaxation is based on the observation that machine learning tasks are often restricted to a particular type of data (for example, finding a film review in the MRI dataset is very unlikely). Moreover, this information, and potentially even some prior distribution of data, is often available to the attacker. While the traditional DP treats all data as equally likely and hides differences by large amounts of noise, BDP calibrates noise to the data distribution. Hence, for any two datasets drawn from the same (arbitrary) distribution, and given the same privacy mechanism with the same amount of noise, BDP provides tighter guarantees than DP. Note that the full knowledge of this distribution is not required, as the necessary statistics can be estimated from data.
We introduce the notion of Bayesian differential privacy in Section IV and extend it to the federated learning setting in Section V. Our experiments (see Section VI) show significant advantage, both in privacy guarantees and the model quality.
The main contributions of this paper are the following:
we adapt the notion of Bayesian differential privacy to federated learning, including more natural non-i.i.d. settings (Section V-A), to provide strong theoretical privacy guarantees under minor and practical assumptions;
we propose a novel joint accounting method for estimating client-level and instance-level privacy simultaneously and securely (Section V-C);
we experimentally demonstrate advantages of our method, such as shrinking the privacy budget to a fraction of the previous state-of-the-art, and improving the accuracy of the trained models by up to (Section VI).
Ii Related Work
As machine learning applications become more and more common, various vulnerabilities and attacks on ML models get discovered, based on both passive (for example, model inversion  and membership inference ) and active adversaries (e.g. ), raising the need for developing matching defences.
Differential privacy [15, 12] is one of the strongest privacy standards that can be employed to protect ML models from these and other attacks. Since pure -DP is hard to achieve in many realistic learning settings, a notion of approximate -DP is used across-the-board in machine learning. It is often achieved as a result of applying the Gaussian noise mechanism .
For a long time, however, even approximate DP remained unachievable in more popular deep learning scenarios. Some earlier attempts  led to prohibitively high bounds on [1, 29] that were later shown to be ineffective against attacks . A major step in the direction of bringing privacy loss values down to more practical magnitudes was done by  with the introduction of the moments accountant, currently a state-of-the-art method for keeping track of the privacy loss during training. Followed by improvements in differentially private training techniques [29, 30], it allowed to achieve single-digit DP guarantees (
) for classic supervised learning benchmarks, such as MNIST, SVHN, and CIFAR.
On the other end of spectrum, McMahan et al. 
proposed federated learning as one possible solution to privacy issues (among other problems, such as scalability and communication costs). In this setting, privacy is enforced by keeping data on user devices and only submitting model updates to the server. Two of the popular approaches are the federated stochastic gradient descent (FedSGD) and federated averaging (FedAvg) , where clients do local on-device gradient descent using their data, then send these updates to the server, which applies an average update to the model. Privacy can further be enhanced by using secure multi-party computation (MPC)  to allow the server access only average updates of a big group of users and not individual ones. However, MPC or homomorphic encryption do not guarantee robustness against model inversion or membership inference [16, 33], because these attacks operate on the resulting model which remains the same. Alternatively, by using differential privacy and the moments accountant,  and  attained theoretical client-level privacy guarantees for federated learning settings.
Since federated learning on its own lacks theoretical privacy guarantees, combining it with a more formal notion of privacy is an attractive direction of research. On the other hand, for more complicated deep learning models, differential privacy leads to a poor privacy-utility trade-off, suggesting that pairing federated learning with an alternative notion might prove beneficial.
Apart from differential privacy, a number of alternative definitions have been proposed over the recent years, such as computational DP , mutual-information privacy [25, 38], different versions of concentrated DP (CDP , zCDP , tCDP ), and Rényi DP (RDP) . Some other relaxations [2, 31, 9] tip the balance even further in favour of applicability at the cost of weaker guarantees, for example considering the average-case instead of the worst-case .
In general, important aspects of a privacy notion are composability, accountability, and interpretability. Apart from sharp bounds, the moments accountant is attractive because it operates within the classic notion of -DP. Some of the alternative notions of DP also provide tight composition theorems, along with some other advantages, but to the best of our knowledge, they are not broadly used in practice compared to traditional DP (although there are some examples ). One of the possible reasons for that is interpretability: parameters of -RDP or -CDP are hard to interpret. While it may be difficult to quantify the actual guarantee provided by specific values of , of the traditional DP, it is still advantageous that they have a clearer probabilistic interpretation.
In this work, we rely on another relaxation, called Bayesian differential privacy . This notion boosts privacy accounting efficiency by utilising the fact that data come from a particular distribution, and not all data are equally likely. At the same time, it maintains the probabilistic interpretation of its parameters and . It is worth noting, that unlike some of the relaxations mentioned above, the notion of Bayesian DP provides the worst-case guarantee (under specified conditions) and is not limited to a particular dataset, but rather a particular type of data (e.g. emails, MRI images, etc.), or a mixture of such types, which is a much more permitting assumption.
In this section, we provide necessary definitions, background and notation used in the paper. We also describe the general setting of the problem.
Iii-a Definitions and notation
We use to represent neighbouring (adjacent) datasets. If not specified, it is assumed that these datasets differ in a single example. Individual examples in a dataset are denoted by or , while the example by which two datasets differ—by . We assume , whenever possible to do so without loss of generality.
Since this paper mainly deals with adding noise to gradients (w.r.t. model parameters, neural network weights, etc.), we often refer to non-noised gradients asnon-private outcome, and denote it by . The private learning outcomes are denoted by . Whenever it is ambiguous, we denote expectation over data (or equivalently, gradients) as , and over the learning outcomes as . Finally, for federated learning scenarios, indicates an update of a user , while —a set of all user updates.
A randomised function (algorithm) with domain and range satisfies -differential privacy if for any two adjacent inputs and for any set of outcomes the following holds:
The privacy loss of a randomised algorithm for outcome and datasets is given by:
For notational simplicity, we omit the designation , i.e. we use (or simply
) for the privacy loss random variable, and, and
for the outcome probability distributions for given datasets. Also, note that the privacy loss random variableis distributed by drawing (see [14, Section 2.1 and Definition 3.1]), which helps linking it to well-known divergences.
The Gaussian noise mechanism achieving -DP, for a function , is defined as
where and is the L2-sensitivity of .
For more details on differential privacy and the Gaussian mechanism, we refer the reader to .
We will also need the definition of Rényi divergence:
Rényi divergence of order between distributions and , denoted as is defined as
where and are corresponding density functions of and .
In the first part of the paper, while describing the concept of Bayesian differential privacy, we consider a general iterative learning algorithm, such that each iteration produces a non-private learning outcome (e.g. a gradient over a batch of data). In the second half of the paper, we consider the equivalent federated learning setting, where each communication round produces a set of non-private learning outcomes , one for each client .
The non-private outcome, in both scenarios, gets transformed into a private learning outcome that is used as a starting point for the next iteration or communication round. The learning outcome can be made private by different means, but in this work we consider the most common approach of applying an additive noise mechanism (e.g. a Gaussian noise mechanism). We denote the distribution of private outcomes by (we assume the Markov property of the learning process for brevity of notation, but it is not necessary in general) or , depending on the scenario.
The process can run on subsamples of data or subsets of clients, in which case comes from the distribution , where is a batch of data used for parameters update in iteration , or , where is a set of updates from users participating in the communication round . In these cases, privacy is amplified through sampling .
For each iteration, we would like to compute a quantity (we call it a privacy cost) that accumulates over the learning process and allows to compute privacy loss bounds using concentration inequalities. The overall privacy accounting workflow does not significantly differ from prior work, but is in fact a generalisation of the well-known moments accountant . Importantly, it is not tied to a specific learning algorithm or a class of algorithms, as long as one can map it to the above setting.
Before we proceed, we find it important to motivate the research and usage of alternative definitions of privacy. The primary reason for this is that the complexity of the concept of differential privacy often leads to misunderstanding or overestimation of the guarantees it provides. And while we do not fully tackle the problem of interpretability, we provide a simple example below that allows to better judge the quality of provided guarantees.
Consider the state-of-the-art differentially private machine learning models [1, 30]. In order to come close to the non-private accuracy (say within of it), all of the reported models stretch their privacy budget to (for a reasonably low ), while in many cases it goes up to . In real-world applications, it can even be larger than 111https://www.wired.com/story/apple-differential-privacy-shortcomings/. These numbers seem small, and thus, may often be overlooked. But let us present an alternative interpretation.
What we are interested in is the change in the posterior distribution of the attacker after they see the private model compared to prior [27, 8]. Let us consider the stronger, pure DP for simplicity. According to the definition of -DP:
Assume the following specific example. The datasets consist of income values for residents of a small town. There is one individual whose income is orders of magnitude higher than the rest, and whose residency in the town is what the attacker wishes to infer. The attacker observes the mean income sanitised by a differentially private mechanism with . If the individual is not present in the dataset, the probability of being above a certain threshold is extremely small. On the contrary, if is present, this probability is higher (say it is equal to ). The attacker takes a Bayesian approach, computes the likelihood of the observed value under each of the two assumptions and the corresponding posteriors given a flat prior. The attacker then concludes that the individual is present in the dataset and is a resident.
By the above expression, can only be times larger than the corresponding probability without . But if the is small enough, then the probability of the attacker’s guess being correct is as high as or, equivalently,
To put it in perspective, for a DP algorithm with , the upper bound on the accuracy of this attack is as high as . For , it is . For , . Remember that we used an uninformative flat prior, and for a more informed attacker these numbers could be even larger.
In a more realistic scenario, even without any privacy protection, this high accuracy is not likely to be achieved by the attacker. So such guarantee is hardly better than no guarantee, and cannot be seen as reassuring. Thus, we want to encourage the discussion and search for more meaningful privacy definitions or DP relaxations for machine learning and federated learning. One such relaxation we present and explore in this paper.
Iv Bayesian Differential Privacy
In this section, we describe Bayesian differential privacy (BDP), accompanied by a practical privacy loss accounting method. We restate just the main results necessary for the following section, while all the details, proofs and experimental evaluation of BDP can be found in .
A randomised function (algorithm) with domain and range satisfies -Bayesian differential privacy if for any two adjacent datasets , differing in a single data point , and for any set of outcomes the following holds:
To derive tighter sequential composition, we use an alternative definition that implies the above:
where probability is taken over the randomness of the outcome and the additional example .
This definition is very close to the original definition of DP, except that it also takes into account the randomness of . Hence, the basic properties are similar to those of DP, although BDP does not provide guarantees in all the scenarios where DP does (e.g. when the distribution is non-stationary, it is possible that BDP underestimates the actual privacy loss).
While Definition 5 does not specify the distribution of any point in the dataset other than the additional example , it is natural and convenient to assume that all examples in the dataset are drawn from the same distribution . This holds in many real-world applications, including all applications evaluated in this paper, and it allows using sampling techniques instead of requiring knowing the true distribution.
We also assume that all data points are exchangeable , i.e. any permutation of data points has the same joint probability. It enables tighter accounting for iterative mechanisms, and is naturally satisfied in the considered scenarios.
Since basic composition is not enough to provide tight privacy bounds for iterative federated learning algorithms in which we are interested, let us present two theorems, generalising upon the moments accountant routine. We use it as a foundation for the privacy accounting framework for federated learning presented in the next section.
Theorem 1 (Advanced Composition).
Let a learning algorithm run for iterations. Denote by a sequence of private learning outcomes obtained at iterations , and the corresponding total privacy loss. Then,
where , , and is Rényi divergence between and .
We denote the logarithm of the quantity inside the product in Theorem 1 as and call it the privacy cost of the iteration, or communication round, :
The privacy cost of the whole learning process is then a sum of the costs of each iteration.
Let the algorithm produce a sequence of private learning outcomes using a known probability distribution . Then, and are related as
Under the conditions above, for a fixed :
Theorem 2 provides an efficient privacy accounting algorithm. During training, we compute the privacy cost for each iteration , accumulate it, and then use to compute pair. This process is ideologically close to that of the moment accountant, but accumulates a different quantity (note the change from the privacy loss random variable to Rényi divergence and expectation over data).
Computing precisely requires access to the prior distribution of data , which is unrealistic. Therefore, we need an estimator for . Moreover, since Chernoff bound, which our Theorem 2 is based on, only holds for the true expectation value, we have to take into account the estimator error. To solve this, we employ a Bayesian view of the estimation problem  and use the upper confidence bound of the expectation estimator.
Let us define the following -sample estimator of :
are the sample mean and the sample standard deviation of, is the inverse of the Student’s -distribution CDF at with degrees of freedom, and
One can show that, for continuous distributions and , overestimates the true privacy cost with probability . Therefore, the probability of underestimation can be fixed upfront and incorporated in .
This step changes the interpretation of in Bayesian differential privacy compared to the traditional DP. Apart from the probability of the privacy loss exceeding , e.g. in the tails of its distribution, it also incorporates our uncertainty about the true data distribution (in other words, the probability of underestimating the true expectation because of not observing enough data samples). It can be intuitively understood as accounting for unlikely or unobserved data in , rather than in by adding more noise.
For discrete outcomes, a different estimator needs to be derived to meet the same bound. The process of obtaining it is identical to the one above, with the only change in the maximum entropy distribution.
There are other differences compared to the classic DP, such as allowance for unbounded sensitivity or estimator privacy, that are not discussed in this work. More details can be found in .
Gaussian Noise Mechanism. Consider the subsampled Gaussian noise mechanism . The outcome distribution in this case is the mixture of two Gaussians , where and are non-private outcomes at the iteration (e.g. gradients), is the noise parameter, and is the data sampling probability. Plugging the outcome distribution into the formula for Rényi divergence, we get the following result for the privacy cost.
Given the Gaussian noise mechanism with the noise parameter and subsampling probability , the privacy cost for at iteration can be expressed as
and is the binomial distribution with
is the binomial distribution withexperiments and the probability of success .
V Federated Learning with Bayesian Differential Privacy
In this section, we adapt the Bayesian differential privacy framework and its accounting method to guarantee the client-level privacy, the level most frequently addressed in the literature. We then justify and explore the instance-level privacy and two different techniques for accounting it. Finally, we propose a method to jointly account instance-level and client-level privacy for the FedSGD algorithm in order to provide the strongest trade-off between utility and privacy guarantees.
V-a Client privacy
When it comes to reinforcing federated learning with differential privacy, the foremost attention is given to the client-level privacy [24, 18]. The goal is to hide the presence of a single user, or to be more specific, to bound the influence of any single user on the learning outcome distribution (i.e. the distribution of the model parameters).
and then adding Gaussian noise with the variance. The noise parameter is calibrated to bound the privacy loss in each communication round, and then the privacy loss is accumulated across the rounds using the moments accountant .
We use the same privacy mechanism, but employ the Bayesian accounting method instead of the moments accountant. Intuitively, our accounting method should have a significant advantage over the moments accountant in the settings where data is distributed similarly across the users because in this case their updates would be in a strong agreement. In order to map the Bayesian differential privacy framework to this setting, let us introduce some notation.
Let denote the number of clients in the federated learning system. Every client computes and sends to the server a model update drawn from the client’s update distribution . Considering individual client distributions ensures that our approach is applicable to non-i.i.d. settings that are natural in the federated learning context. Generally, not all users participate in a given communication round. We denote the probability of a user participating in the round by . Thus, the overall update distribution is given by a mixture:
In our experiments, we fix .
To match the notation above, let indicate the privacy-preserving model update:
where in the case of Gaussian mechanism, and is the set of updates from users participating in the round .
To bound and of Bayesian differential privacy, one needs to compute , where
Since the randomness of comes from the subsampled Gaussian noise mechanism, we use Theorem 3, in combination with user sampling and the estimator (5) for both expressions, to obtain that upper-bounds with high probability.
V-B Instance privacy
As noted above, the same privacy mechanism can be used in conjunction with the moments accountant to get the classic DP guarantees [24, 18]. In this case, -DP at the client level implies the same guarantee at the instance level (i.e. bounding the influence of a single data point). However, it does not hold for BDP. Moreover, the same privacy guarantee may not be meaningful at the instance level. For example, might be reasonable for clients, but if a client has tens of thousands of data points, it is not a reasonable failure probability at the data point level.
At the same time, instance privacy is extremely important in some scenarios. Imagine federated training on medical data from different hospitals: while a hospital participation may be public knowledge, individual patients data must be protected at the highest degree. Another reason for considering instance-level privacy is that it provides an additional layer of protection for users in case of an untrusted curator.
In order to get tighter instance privacy guarantees, we apply the subsampled Gaussian noise mechanism to gradient computation on user devices. The accounting follows the same procedure as described above, except that the noise parameter and the sampling probability may be different, depending on which of the settings described below is used.
There are two possible accounting schemes:
V-B1 Sequential accounting
Part of the accounting is performed locally on user devices and part on the server. Overall privacy cost is equivalent to the centralised training with the data sampling probability , where is the total number of data points across all users, and is the local batch size.
The process proceeds as follows. At each communication round, the server sends to participating clients, every client performs private gradient updates, computes , and sends it to the server. The server then aggregates the sum of from all users. Since the privacy costs are data-dependent, it is possible to use secure multi-party computation to allow the server know the sum without learning individual costs.
The disadvantage of this method is that every participating client learns the total number of data points, and especially in the settings with a small number of users it may not be desirable. Furthermore, the obtained bounds apply to the commonly learnt model but not to the individual updates of each user, requiring them to maintain a separate local bound. These issues are addressed by parallel accounting.
V-B2 Parallel accounting
In this scheme, every client computes using , where is the local dataset size of the client. Consequently, since , the privacy costs will be higher. But this is compensated by using parallel composition instead of sequential: the server aggregates the maximum of over all users. Again, using secure multi-party computation is possible to prevent the server from learning individual privacy costs.
Parallel composition is applicable in this scenario because user updates within a single round are independent. However, the server needs to sum up maximum privacy costs over the rounds because updates are dependent on previous rounds.
As we show in Section VI, parallel accounting may require more communication rounds to converge to the same quality solution with the same privacy guarantee. The gap is more notable on non-identically distributed data.
V-C Joint privacy
Instance privacy provides tighter and more meaningful guarantees for every data point contribution to the trained model. Nevertheless, there is a downside: adding noise both during on-device gradient descent as well as during the averaging phase on the server results in slow convergence or complete divergence of the federated learning algorithm.
To tackle this problem, we propose joint accounting, where the noise added on the client side is re-counted towards the client-level privacy guarantee.
The main idea of joint accounting is that a client update received by the server is already noisy when instance privacy is enforced, and instead of adding more noise the server can re-count existing noise to compute the client-level bound222A version of this technique for differential privacy has also been explored in a master’s thesis project with Nikolaos Tatarakis.. The only problem: the server cannot sample non-private client updates to estimate because it no longer has access to their distribution.
Fortunately, the inner expectation in and can be computed locally, suggesting the following procedure. Every client computes (Eq. 5) with and being the private outcome distributions with and without their entire update. Then, the server computes , , and by simple averaging. Additionally, one can implement this averaging step with secure multi-party computation to further privacy protection. For the moment, however, it can only be used with FedSGD, and not FedAvg, because every noisy step in FedAvg would change the point at which the gradient is computed, potentially leading to a different gradient distribution or underestimated total noise variance.
Using joint accounting allows to achieve tight instance and client privacy guarantees, and at the same time, preserve the speed of convergence almost at the same level as the client-privacy-only solution (see Section VI-D).
In this section, we provide results of the experimental evaluation of our approach. We begin by describing the datasets we used, as well as the setting details shared by all experiments. The subsequent structure follows that of the previous section. We first evaluate the client-level privacy by comparing accuracy and privacy guarantees of the traditional DP method  to ours (Section VI-B). Then, in Section VI-C, we perform experiments on the two proposed methods of instance privacy accounting. Finally, Section VI-D describes the results of the joint accounting approach.
Vi-a Experimental setup
We perform experiments on two datasets. The first dataset is MNIST . It is a standard image classification task widely used in machine learning research. More specifically, it is a handwritten digit recognition dataset consisting of 60000 training examples and 10000 test examples, each example is a 28x28 greyscale image. The second is the APTOS 2019 Blindness Detection challenge dataset333https://www.kaggle.com/c/aptos2019-blindness-detection/overview/description (in figures, tables and text, we refer to this dataset as Retina or APTOS). It consists of 3662 retina images taken using fundus photography. The images are labelled by clinicians to reflect the severity of diabetic retinopathy on the scale from 0 to 4. Unlike other datasets commonly evaluated in the privacy literature [1, 24, 18], this one actually has more serious implications of a privacy leak.
All experiments have the following general setup. There is a number of clients (100, 1000, or 10000), each holding a subset of data, and the server that coordinates federated training of the shared model. Some setups with a higher number of users will entail repetition of data, like in , which is a natural scenario in some applications, e.g. shared or very similar images on different smartphones. In MNIST experiments, each user holds 600 examples. For the APTOS dataset, we use data augmentation techniques (e.g. random cropping, resizing, etc.) to obtain a larger training set, and then split it such that every client gets 350 images.
Testing is performed on the official test split for MNIST, and on the first 500 samples in case of APTOS.
While the parameters of the training vary based on experiments, the models remain the same. For MNIST, we use a simple CNN with two convolutional layers and two fully connected layers (similar to the one described in the TensorFlow tutorial444https://www.tensorflow.org/tutorials/images/deep_cnn). In case of APTOS, due to the small dataset size and a harder learning task, we employ ResNet-50 
pre-trained on ImageNet and re-train only the last fully-connected layer of the network. We do not do extensive hyper-parameter tuning in general, since we are interested in relative performance of private models compared to non-private ones rather than the best classification accuracy, and thus, our non-private baseline results may not always match the ones reported in . For the same reason, we restrict the number of communication rounds () and use FedSGD instead of FedAvg, although all the methods except for joint accounting are compatible with FedAvg.
One of the important aspects of federated learning is that data might not be distributed identically among users. In agreement with previous work [23, 18], we include experiments in both i.i.d. and non-i.i.d. settings for MNIST, because it allows for a natural non-identical split. More specifically, in the i.i.d. setting, every user is assigned a subset of uniformly sampled examples. In the non-i.i.d. setting, we follow the same scheme as  and : splitting the dataset on shards of 300 points within the same class and then assigning 2 random shards to each client. The scenario of 100 clients with non-identically distributed data is particularly hard for privacy applications: there are possible digit combinations that clients can hold and only 100 clients, meaning that some clients might be easily distinguishable by their data distribution. Therefore, it is important to note that it may not be possible to obtain a reasonable privacy bound in this scenario without seriously compromising accuracy.
The privacy accounting is performed by two methods. To obtain the bounds on and of differential privacy, we use the moments accountant , the state-of-the-art DP accounting method. In the case of Bayesian differential privacy, we follow the technique described in Sections IV and V: we sample a number of user updates (or gradients for instance privacy), estimate an upper bound on the privacy cost, and use Chernoff inequality to compute the corresponding pair of .
Vi-B Client privacy
In this experiment, we test adding client privacy the same way it is done in  and . We use Bayesian accounting, as described in Section V-A, and compare it to the classic differential privacy accounting by the moments accountant . We fix the noise level and account DP and BDP in parallel.
summarise accuracy and privacy guarantees obtained in this setting for MNIST (non-i.i.d. and i.i.d.) and APTOS respectively. The first column indicates the number of clients, the second—the baseline accuracy of a non-private federated classifier (models described in the previous section). The following columns contain accuracy and privacy parameters obtained for private models using the classic DP and BDP. Despite being trained in parallel, the two techniques may differ in accuracy because in some cases we do early stopping for DP to prevent exceeding privacy budget.
In all cases and for all datasets, we observe substantial benefits of using Bayesian accounting. The accuracy gains are most notable in the non-i.i.d. setting of MNIST, where our method can achieve higher accuracy in the 100 clients setting, because it presents a more difficult learning scenario as explained in the previous section. The privacy gains are consistently significant across all datasets and settings, and taking into account the fact that is exponentiated to get the bound on outcome probability ratios, BDP can reach times stronger guarantee. Nevertheless, in the settings with few clients, even Bayesian differential privacy does not reach a more comfortable guarantee of , suggesting that a better privacy-accuracy trade-off may not be feasible due to higher clients identifiability, or that more work is needed in improving training with noise and developing novel privacy mechanisms for federated learning.
Importantly, there is no computation or communication overhead from the users’ point of view in these experiments since the privacy accounting code is executed on the server.
Vi-C Instance privacy
As noted in Section V-B, instance privacy is very important in scenarios like collecting medical data from a number of hospitals where patient privacy is at least as crucial as hospital privacy. In this section, we compare two accounting methods proposed earlier: sequential and parallel accounting.
Depicted in Figure 1 are the curves showing the growth of estimate with communication rounds. We subtracted the initial value and applied logarithmic scale in order to better show the difference in the rate of growth. Across all settings, it can be seen that parallel accounting leads to faster growth rates, despite the fact that the parallel composition is more efficient (taking maximum over clients instead of a sum). This behaviour can be explained by the fact that each client is unaware of the total dataset size and, having a small number of data points, is convinced that every data example has significant influence on the outcome. The unawareness about other clients in the case of parallel accounting can also explain the fact that we don’t observe any improvement in the growth rate with increasing the number of clients. The only exception is the non-i.i.d. MNIST experiment, where the difference is likely to come from increased stability of training and decreased gradient variability with more clients.
The main takeaway from this experiment is that it is beneficial to use sequential accounting for privacy of the federated model whenever communicating the total size of the dataset to users is acceptable. In other cases, and for personal privacy accounting in case of the untrusted curator, parallel accounting can be used, but more noise is necessary for reasonable privacy guarantees.
Vi-D Joint privacy
Lastly, we would like to test the proposed method of the joint accounting of instance-level and client-level privacy and contrast it to accounting at these two levels separately. We perform experiments in the same settings as above, fixing the client privacy at a certain level () and evaluating the speed and quality of training. We also compare to what can be achieved by introducing privacy only at the client level.
Figure 2 displays the test accuracy evolution over communication rounds in the setup of 100 clients for APTOS and 1000 clients for MNIST. The graphs contain curves for training without privacy, client-level-only privacy, and the two accounting paradigms: joint and split. As expected, the non-private training achieves the best accuracy. Nevertheless, in the i.i.d. setting, client-only private training quickly approaches non-private training in quality. Notably, training with both instance and client privacy using the joint accounting performs nearly as well, while training with the separate accounting completely fails due to excessive amounts of noise at both instance and client levels. For the non-i.i.d. setting, private training is slower, but there is little difference between introducing privacy only at the client level and using the joint accounting at both levels: after a slightly larger number of rounds, training with the joint accounting reaches similar performance. Based on these experiments, we conclude that by using joint accounting we can introduce instance privacy on clients and get client-level privacy at almost no cost.
Finally, we evaluate our method in the strong privacy setting. We set the instance-level privacy to and stop training when the client privacy reaches the level similar to previous experiments (except APTOS dataset, where we were able to achieve comparable results with lower privacy cost), and report the accuracy that can be achieved in this strict setting. We have also chosen the most difficult scenario of 100 clients. As seen in Table IV, the algorithm with differential privacy performs very poorly on APTOS dataset, and fails to learn on MNIST, in both i.i.d. and non-i.i.d. setting. Its performance is especially affected by the strict instance privacy requirement, since such low levels of
necessitate large quantities of noise to be added. It is worth noting, that it might be possible to achieve better results with DP by performing per-example gradient clipping, as in, but we do not use this technique due to its impracticality.
On the other hand, our approach manages to achieve reasonable accuracy even under such a strong privacy guarantee. On APTOS dataset, it is just lower than the non-private baseline, while on MNIST, it correctly classifies more than of the test data in the i.i.d. setting and over in the non-i.i.d. setting. One could potentially add more noise on the server and combine the accounting with the instance level noise to slow down the growth of and reach even better performance, but we leave these experiments for future work.
Both instance and joint privacy accounting add some computation overhead on user devices due to multiple gradient calculations. However, performing FL routines when devices are idle and charging, as suggested in , alleviates this problem. Communication overhead is negligible because only a single floating point number is added to user messages.
We employ the notion of -Bayesian differential privacy, a relaxation of -differential privacy, to obtain tighter privacy guarantees for clients in the federated learning settings. The main idea of this approach is to utilise the fact that users come from a certain population with similarly distributed data, and therefore, their updates will likely be in agreement with each other. This is a meaningful assumption in many machine learning scenarios because they target a specific type of data (e.g. medical images, emails, motion sensor data, etc.). For example, it may be unjustified to try hiding an absence of an audio record in a training set for the ECG analysis, since the probability of it appearing is in fact much smaller than .
We adapt an efficient and tight privacy accounting method for Bayesian differential privacy to the federated setting in order to estimate client privacy guarantees. Moreover, we emphasise the importance of instance-level privacy and propose two variants of privacy accounting at this level. Finally, we introduce a novel technique of joint accounting suitable for obtaining privacy guarantees at instance and client levels jointly from only instance-level noise.
Our evaluation provides evidence that Bayesian differential privacy is more appropriate for federated learning. First, it requires significantly less noise to reach the same privacy guarantees, allowing models to train in fewer communication rounds. Second, the bounds on privacy budget are much tighter, and thus, more meaningful. When the number of clients reaches an order of thousands, which is realistic in many federated learning scenarios, can be kept below . Finally, we demonstrate that by using joint accounting we can get client privacy for free when adding instance privacy. This way, the privacy budget can be kept close to for client privacy and for instance privacy while maintaining reasonably high accuracy.
An important future direction of research is automatically detecting and mitigating scenarios in which tighter privacy guarantees are inapplicable, such as non-stationary data distributions or datasets with non-exchangeable samples.
-  (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §I, §II, §III-B, §III-C, §V-A, §VI-A, §VI-A, §VI-B, §VI-D.
Differential privacy applications to bayesian and linear mixed model estimation. Journal of Privacy and Confidentiality 5 (1), pp. 4. Cited by: §II.
-  (1985) Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983, pp. 1–198. Cited by: §IV.
-  (2018) Privacy amplification by subsampling: tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems, pp. 6280–6290. Cited by: §III-B.
-  (2019) Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046. Cited by: §VI-D.
Composable and versatile privacy via truncated cdp.
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 74–86. Cited by: §II.
-  (2016) Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635–658. Cited by: §II.
-  (2017) A teaser for differential privacy. Cited by: §III-C.
-  (2017) On the meaning and limits of empirical differential privacy. Journal of Privacy and Confidentiality 7 (3), pp. 3. Cited by: §II.
-  (2011) Differentially private empirical risk minimization. Journal of Machine Learning Research 12 (Mar), pp. 1069–1109. Cited by: §I.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §VI-A.
-  (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §II.
-  (2014) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §II, §III-A, §IV.
-  (2016) Concentrated differential privacy. arXiv preprint arXiv:1603.01887. Cited by: §II, §III-A.
-  (2006-07) Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), Vol. 4052, Venice, Italy, pp. 1–12. External Links: Cited by: §I, §II.
-  (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333. Cited by: §I, §II, §II.
-  (2017) Renyi differential privacy mechanisms for posterior sampling. In Advances in Neural Information Processing Systems, pp. 5289–5298. Cited by: §II.
-  (2017) Differentially private federated learning: a client level perspective. arXiv preprint arXiv:1712.07557. Cited by: §I, §II, §V-A, §V-A, §V-B, §VI-A, §VI-A, §VI-A, §VI-B, §VI.
-  (2013) Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249, pp. 124–131. Cited by: §III-A.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §VI-A.
-  (2017) Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 603–618. Cited by: §II, §II.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §VI-A.
-  (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §I, §II, §VI-A, §VI-A.
-  (2017) Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963. Cited by: §I, §I, §II, §V-A, §V-A, §V-B, §VI-A, §VI-B.
-  (2012) Information-theoretic foundations of differential privacy. In International Symposium on Foundations and Practice of Security, pp. 374–381. Cited by: §II.
-  (2009) Computational differential privacy. In Annual International Cryptology Conference, pp. 126–142. Cited by: §II.
-  (2017) Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th, pp. 263–275. Cited by: §II, §III-C.
-  (2006) A bayesian perspective on estimating mean, variance, and standard-deviation from data. Cited by: §IV.
-  (2016) Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755. Cited by: §I, §II.
-  (2018) Scalable private learning with pate. arXiv preprint arXiv:1802.08908. Cited by: §I, §II, §III-C.
-  (2015) A new method for protecting interrelated time series with bayesian prior distributions and synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society) 178 (4), pp. 963–975. Cited by: §II.
-  (2015) Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §II.
-  (2017) Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 3–18. Cited by: §I, §II, §II.
-  (2013) Stochastic gradient descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pp. 245–248. Cited by: §I.
-  (2019) Bayesian differential privacy for machine learning. arXiv preprint arXiv:1901.09697. Cited by: §I, §II, §IV, Remark 3.
-  (2019) Generating artificial data for private deep learning. In Proceedings of the PAL: Privacy-Enhancing Artificial Intelligence and Language Technologies, AAAI Spring Symposium Series, CEUR Workshop Proceedings, pp. 33–40. External Links: Cited by: §II.
Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory 60 (7), pp. 3797–3820. Cited by: §III-A.
-  (2016) On the relation between identifiability, differential privacy, and mutual-information privacy. IEEE Transactions on Information Theory 62 (9), pp. 5018–5029. Cited by: §II.
-  (1982) Protocols for secure computations. In FOCS, Vol. 82, pp. 160–164. Cited by: §II.