Differentially-private Federated Neural Architecture Search

06/16/2020 ∙ by Ishika Singh, et al. ∙ 0

Neural architecture search, which aims to automatically search for architectures (e.g., convolution, max pooling) of neural networks that maximize validation performance, has achieved remarkable progress recently. In many application scenarios, several parties would like to collaboratively search for a shared neural architecture by leveraging data from all parties. However, due to privacy concerns, no party wants its data to be seen by other parties. To address this problem, we propose federated neural architecture search (FNAS), where different parties collectively search for a differentiable architecture by exchanging gradients of architecture variables without exposing their data to other parties. To further preserve privacy, we study differentially-private FNAS (DP-FNAS), which adds random noise to the gradients of architecture variables. We provide theoretical guarantees of DP-FNAS in achieving differential privacy. Experiments show that DP-FNAS can search highly-performant neural architectures while protecting the privacy of individual parties. The code is available at https://github.com/UCSD-AI4H/DP-FNAS



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

footnotetext: *Equal contributionfootnotetext: The work was done during internship at UCSD.

In many application scenarios, the data owner would like to train machine learning (ML) models using their data that contains sensitive information, but the size of the data is limited. Many ML methods, especially deep learning methods, are data hungry. Having more data for model training usually improves performance. One way to have more training data is to combine data of the same kind from multiple parties and use the combined data to collectively train a model. However, since each of these datasets contains private information, they are not allowed to share across parties. Federated learning 

(konevcny2016federated; mcmahan2016communication) is developed to address this problem. Multiple parties collectively train a shared model in a decentralized way by exchanging sufficient statistics (e.g., gradients) without exposing the data of one party to another.

While preserving privacy by avoiding sharing data among different parties, federated learning (FL) incurs difficulty for model design. When ML experts design the model architecture, they need to thoroughly analyze the properties of data to obtain insights that are crucial in determining which architecture to use. In an FL setting, an expert from one party can only see the data from this party and is not able to analyze the data from other parties. Without having a global picture of all data from different parties, ML experts are not well-equipped to design a model architecture that is optimal for fulfilling the predictive tasks in all parties. To address this problem, we resort to automated neural architecture search (zoph2016neural; liu2018darts; real2019regularized) by designing search algorithms to automatically find out the optimal architecture that yields the best performance on the validation datasets.

To this end, we study federated neural architecture search (FNAS), where multiple parties collaboratively search for an optimal neural architecture without exchanging sensitive data with each other for the sake of preserving privacy. For computational efficiency, we adopt a differentiable search strategy (liu2018darts). The search space is overparameterized by a large set of candidate operations (e.g., convolution, max pooling) applied to intermediate representations (e.g., feature maps in CNN). Each operation is associated with an architecture variable indicating how important this operation is. The prediction loss is a continuous function w.r.t the architecture variables as well as the weight parameters in individual operations. and are learned by minimizing the validation loss using a gradient descent algorithm. After learning, operations with top- largest architecture-variables are retained to form the final architecture.

In FNAS, a server maintains the global state of and . Each party has a local copy of and . In each iteration of the search algorithm, each party calculates gradient updates of and based on its local data and local parameter copy, then sends the gradients to the server. The server aggregates the gradients received from different parties, performs a gradient descent update of the global state of and , and sends the updated parameters back to each party, which replaces its local copy with the newly received global parameters. This procedure iterates until convergence.

Avoiding exposing data is not sufficient for privacy preservation. Several studies (bhowmick2018protection; carlini2018secret; fredrikson2015model) have shown that intermediates results such as gradients can reveal private information. To address this problem, we study differentially-private FNAS (DP-FNAS), which adds random noise to the gradients calculated by each party to retain differential privacy (dwork2006calibrating). We provide theoretical guarantees of DP-FNAS in privacy preservation. Experiments demonstrate that while protecting the privacy of individual parties, the architectures searched by DP-FNAS can achieve high accuracy that is comparable to those searched by single-party NAS.

The major contributions of this paper are as follows:

  • [leftmargin=*]

  • We propose differentially-private federated learning for neural architecture search (DP-FNAS), which enables multiple parties to collaboratively search for a highly-performant neural architecture without sacrificing privacy.

  • We propose a DP-FNAS algorithm which uses a parameter server framework and a gradient-based method to perform federated search of neural architectures. The gradient is obfuscated with random noise to achieve differential privacy.

  • We provide a theoretical guarantee of our algorithm in terms of privacy preservation.

  • We perform experiments which show that DP-FNAS can search highly-performant neural architectures while protecting the privacy of individual parties..

The rest of the papers are organized as follows. Section 2 and 3 present the method and experiments. Section 4 reviews related works and Section 5 concludes the paper.

2 Methods

We assume there are parties aiming to solve the same predictive task, e.g., predicting whether a patient has pneumonia based on his or her chest X-ray image. Each party has a labeled dataset containing pairs of input data example and its label. For instance, the data example could be a chest X-ray and the label is about whether the patient has pneumonia. The datasets contain sensitive information where privacy needs to be strongly protected. Therefore, the parities cannot share their datasets with each other. One naive approach is: each party trains a model using its own data. However, deep learning methods are data hungry: more training data usually leads to better predictive performance. It is preferable to leverage all datasets from different parties to collectively train a model, which presumably has better predictive performance than the individual models belonging-to different parties, each trained on a party-specific dataset. How can one achieve this goal without sharing data between parties?

Federated learning (FL) (konevcny2016federated; mcmahan2016communication) is a learning paradigm designed to address this challenge. In FL, different parties collectively train a model by exchanging sufficient statistics (e.g., gradient) calculated from their datasets, instead of exchanging the original data directly. There is a server maintaining the weight parameters of the global model to be trained. Each party has a local copy of the model. In each iteration of the training algorithm, each party uses its data and the local model to calculate a gradient

of the predictive loss function

with respect to . Then it sends to the server. The server aggregates the gradients received from different workers and performs a gradient descent update of the global model: , where is the learning rate. Then it sends the updated global model back to each party, which replaces its local model with the global one. This procedure iterates until convergence. In this process, the dataset of each party is not exposed to any other party or the server. Hence its privacy can be protected to some extent (later, we will discuss a stronger way of protecting privacy).

Though FL provides a nice way of effectively using more data for model training while preserving privacy, it poses some difficulties on how to design the model architecture. For ML experts, to design an effective model architecture, the experts need to thoroughly analyze the properties of the data. In an FL setting, the expert from each party can only see the data from this party, not that from others. Without a global picture of all datasets, these experts are not well-equipped to design an architecture that is optimal for the tasks in all parties.

To address this problem, we resort to automatic neural architecture search (NAS) (zoph2016neural; liu2018darts; real2019regularized). Given a predictive task and labeled data, NAS aims to automatically search for the optimal neural architecture that can best fulfill the targeted task. The problem can be formulated in the following way:


where and are the training data and validation data respectively. denotes the neural architecture and denotes the weights of the model whose architecture is . Given a configuration of the architecture, we train it on the training data and obtain the best weights . Then we measure the loss of the trained model on the validation set. The goal of an NAS algorithm is to identify the best

that yields the lowest validation loss. Existing search algorithms are mostly based on reinforcement learning 


, evolutionary algorithm 

(real2019regularized), and differentiable NAS (liu2018darts). In this work, we focus on differentiable NAS since it is computationally efficient.

To this end, we introduce federated neural architecture search (FNAS), which aims to leverage the datasets from different parties to collectively learn a neural architecture that can optimally perform the predictive task, without sharing privacy-sensitive data between these parties. The FNAS problem can be formulated as:


where and denote the training and validation dataset belonging to the party respectively. A naive algorithm for FNAS performs the following steps iteratively: given a configuration of the architecture, use the gradient-based FL method to learn the optimal weights on the training data; then evaluate on the validation data of each party and aggregate the evaluation results. The validation performance is used to select the best architecture. Certainly, this is not efficient or scalable. We resort to a differentiable search approach (liu2018darts). The basic idea of differentiable NAS is: set up an overparameterized network that combines many different types of operations; each operation is associated with an architecture variable (AV) indicating how important the operation is; optimize these AVs together with the weight parameters in the operations to achieve the best performance on the validation set; operations with top- largest AVs are selected to form the final architecture. A neural architecture can be represented as a directed acyclic graph (DAG) where the nodes represent intermediate representations (e.g., feature maps in CNN) and edges represent operations (e.g., convolution, pooling) over nodes. Each node is calculated in the following way: , where is a set containing the ancestor nodes of . denotes the operation associated with the edge connecting to . In differentiable NAS, this DAG is overparameterized: the operation on each edge is a weighted combination of all possible operations. Namely, , where is the -th operation (parameterized by a set of weights) and is the total number of operations. is an architecture variable representing how important is. In the end, the prediction function of this neural network is a continuous one parameterized by the variables representing the architecture and the weight parameters . The prediction loss function is end-to-end differentiable w.r.t both and , which can be learned by gradient descent. After learning, operations with top- largest architecture variables are retained to form the final architecture. The problem in Eq.(2) can be approximately solved by iteratively performing the following two steps:

  • [leftmargin=*]

  • Update weight parameters :

  • Update architecture variables :


    where can be approximately computed as


    where , , and .

The server holds the global version of and . Each party has a local copy: and , and also holds an auxiliary variable . FNAS iteratively performs the following steps until convergence. (1) Each party uses , , and to calculate , and sends it to the server; (2) The server aggregates received from different parties, performs a gradient descent update of the global : , and sends the updated global to each party which replaces its with ; (3) Each party calculates the gradient in Eq.(5) and sends it to the server; (4) The server aggregates received from different parties, updates , and sends the updated to each party; (4) Each party replaces with and replaces with .

In federated NAS, while the sensitive data of each party can be protected to some extent by avoiding sharing the data with other parties, there is still a significant risk of leaking privacy due to the sharing of intermediate sufficient statistics (e.g., gradients) among parties. It has been shown in several works that the intermediate sufficient statistics can reveal private information if leveraged cleverly (bhowmick2018protection; carlini2018secret; fredrikson2015model). To address this problem, we study differentially-private (DP) FNAS, which uses DP techniques (dwork2006calibrating; dwork2008differential) to achieve a stronger preservation of privacy. A DP algorithm (with a parameter measuring the strength of privacy protection) guarantees that the log-likelihood ratio of the outputs of the algorithm under two databases differing in a single individual’s data is smaller than . That means, regardless of whether the individual is present in the data, an adversary’s inferences about this individual will be similar if is small enough. Therefore, the privacy of this individual can be strongly protected. Several works have shown that adding random noise to the gradient can achieve differential privacy (rajkumar2012differentially; song2013stochastic; agarwal2018cpsgd). In this work, we follow the same strategy. For each worker, the gradient updates of and are added with random Gaussian noise before sent to the server:


where the elements of and

are drawn randomly from univariate Gaussian distributions with zero mean and a variance of

and respectively. Algorithm 1

shows the execution workflow in one iteration of the differentially-private federated NAS (DP-FNAS) algorithm. Per-sample gradient clipping is used with hyperparameters

and .

for each party k do
   Take a Poisson subsample

with subsampling probability

   for  do
       {Gradient clipping}
   end for
    {Gaussian mechanism}
end for
On the server side:
Send to each party
for each party k do
   Take a Poisson subsample with subsampling probability
   for  do
       {Gradient clipping}
   end for
    {Gaussian mechanism}
end for
On the server side:
Send to each party
for each party k do
end for
Algorithm 1 Execution semantics in each iteration of the DP-FNAS algorithm

3 Theoretical Analysis

In this section, we provide theoretical analysis on the differential privacy (DP) guarantees of the proposed DP-FNAS algorithm. We consider a recently proposed privacy definition, named -DP (dong2019gaussian) owing to its tractable and lossless handling of privacy primitives like composition, subsampling, etc. and superior accuracy results than -DP (dong2019gaussian; bu2019deep). Broadly, composition is concerned with a sequence of analysis on the same dataset where each analysis is informed by the exploration of prior analysis from the previous iteration. Our proposed gradient-based FNAS algorithm involves two instances of private gradient sharing or Gaussian mechanism (10.1561/0400000042), for optimizing weight parameters and architecture variables, between the parties and the central server. One of the two mechanisms composes over the other in one iteration, hence they keep composing onto each other over further iterations of the algorithm. We provide a decoupling analysis of these two mechanisms over the iterations, by leveraging the fact that the datasets used for the two mechanisms are disjoint (one on training set, the other on validation set). We get the results in terms of Gaussian differential privacy (the focal point of the

-DP guarantee family), which ensure privacy in a very interpretable manner by associating it to the hardness of telling apart two shifted normal distributions.

-DP is a relaxation of -DP recently proposed by (dong2019gaussian). This new privacy definition preserves the hypothesis testing interpretation of differential privacy. Moreover, it can efficiently analyze common primitives associated with differential privacy, including composition, privacy amplification by subsampling, and group privacy. In our proposed FNAS algorithm, mini-batch subsampling is used for improving computational efficiency. A side benefit of subsampling is that it naturally offers tighter privacy bounds since an individual not contained in a subsampled mini-batch enjoys perfect privacy. The -DP leverages this fact efficiently for amplifying privacy. In addition,

-DP includes a canonical single-parameter family that is referred to as Gaussian differential privacy (GDP). GDP is the focal privacy definition, due to a central limit theorem, stating that the privacy guarantee of the composition of private algorithms are approximately equivalent to telling apart two shifted normal distributions.

3.1 Preliminaries

An algorithm is considered private if the adversary finds it hard to determine the presence or absence of any individual in two neighbouring datasets. Two datasets, say

, are said to be neighbors if one can be derived by discarding an individual from the other. The adversary seeks to tell apart the two probability distributions

and , where is the randomized mechanism, using a single draw. In light of this observation, it is natural to interpret what the adversary does is testing two simple hypotheses: : the true dataset is , versus : the true dataset is . Intuitively, privacy is well guaranteed if the hypothesis testing problem is hard. Following this intuition, the definition of -DP (dwork2008differential) essentially uses the worst-case likelihood ratio of the distributions associated with and to measure the hardness of testing the two simple hypotheses. -DP utilizes a more informed measure of this hardness by directly operating with the tradeoff function associated with hypothesis testing. Specifically,

-DP uses the trade-off between type I error and type II error in place of a few privacy parameters in

-DP or other divergence-based DP definitions. With this context, we move forward to some formal definitions as stated in (dong2019gaussian) for our proof.

Definition 3.1

(Trade-off Function) Let and denote the distributions of and , respectively, and let be any (possibly randomized) rejection rule for testing : against : . With these in place, the trade-off function of P and Q is defined as:

Definition 3.2

Let . A (randomized) algorithm is -Gaussian differentially private (GDP) if , for all neighboring datasets and .

That is, -GDP says that determining whether any individual is in the dataset is at least as difficult as telling apart the two normal distributions and based on one draw.

3.2 Privacy analysis

The major results are summarized in the following theorem.

Theorem 3.1

Consider a gradient-based Federated NAS algorithm (Algorithm 1), which subsamples minibatches (using Poisson subsampling), clips gradients, and perturbs gradients for both weight parameters and architecture variables using Gaussian mechanism at each iteration. Assuming that and are disjoint for each party , the algorithm achieves

where GDP refers to Gaussian Differential Privacy, and represent the variance of the added Gaussian noises and respectively, is the number of iterations, is the mini-batch size, and are the number of training and validation examples owned by party , respectively.


  • [leftmargin=*]

  • Intuitively, these privacy bounds reveal that the algorithm gives good privacy guarantees if is small, and or are not too small.

  • Since GDP is achieved through central limit theorem due to composition of distributions over iterations, it is expected that is large enough. This requirement is usually satisfied with general settings of DP-FNAS training procedure.

  • We can also choose different subsampling probability for the two processes, which will reflect accordingly in the privacy bound (

    ). We may also use other subsampling methods like shuffling (randomly permuting and dividing data into folds at each epoch) and uniform sampling (sampling a batch of size

    from the whole dataset at each iteration), which will result in slightly varied privacy bounds.

  • The utilization of subsampling in the proof adds to the privacy improvement, and is also closer to actual experimental settings. This tighter guarantee allows for some space to reduce the variance of the added Gaussian noise, which decreases privacy (as noted in the first remark), but increases the model convergence accuracy (since the noise’ variance is a major factor sacrificing accuracy in private optimization algorithms).

Please refer to the appendix for the proof.

4 Experiments

In this section, we present experimental results on the CIFAR-10 dataset. The task is image classification. Our goal is to search a highly-performing neural architecture for this task. Following (liu2018darts), we first search an architecture cell by maximizing the validation performance. Given the searched cell, we perform augmentation: the cell is used to compose a larger architecture, which is then trained from scratch and measured on the test set.

4.1 Experimental Setup

The search space is the same as that in (liu2018darts). The candidate operations include: and separable convolutions, and dilated separable convolutions, max pooling, average pooling, identity, and zero. The network is a stack of multiple cells, each consisting of 7 nodes. The CIFAR-10 dataset has 60000 images from 10 classes, 50000 for training and 10000 for testing. During architecture search, we used 25000 images of the training set for validation. During augmentation, all 50000 images in the training set were used for training the composed architecture. The variance of noises added to gradient updates of and were both set to 1. The hyperparameters and in gradient clipping were set to 0.01 and 0.1 respectively. We experiment with the following settings:

  • [leftmargin=*]

  • NAS with a single party. The vanilla NAS is performed by a single party which has access to all training and validation data.

  • Federated NAS with parties, where . The training data is randomly split into partitions, each held by one party. So is the validation data. The final architecture is evaluated on the test dataset accessible by the server. The gradients calculated by each party are not obfuscated with random noise.

  • Differentially-private FNAS with parties, where . The gradients calculated by each party are obfuscated with random noise. The rest of settings are the same as those in FNAS.

4.2 Results

#parties Test error Params Search cost #ops
(%) (M) (GPU days)
Vanilla NAS 1 2.8 0.10 3.36 1.25 4
FNAS 2 2.9 0.15 3.36 1.21 4
4 3.2 0.34 3.36 0.67 4
8 3.3 0.40 3.36 0.55 4
DP-FNAS 1 3.0 0.10 3.36 1.39 4
2 3.0 0.12 3.36 1.28 4
4 3.1 0.13 3.36 0.93 4
8 3.4 0.38 3.36 0.59 4
Table 1: Test error under different settings. Note that the search cost is only about architecture search, not including augmentation which trains the composed architecture from scratch.

Variance Validation
of Noise error (%)
0.5 14.0 0.32
1.0 14.0 0.32
2.0 14.4 0.43
5.0 15.1 0.85
8.0 16.4 1.01
10.0 19.2 3.27
Table 2: Validation error achieved by DP-FNAS under different variance of noises. The number of parties is 4.

Table 1 shows the test error and search cost (measured by GPU days) under different settings. From this table, we make the following observations. First, the performance of DP-FNAS with different numbers of parties is on par with that of single-party vanilla NAS. This demonstrates that DP-FNAS are able to search highly-performing neural architectures that are as good as those searched by a single machine while preserving differential privacy of individual parties. Second, in DP-FNAS, as the number of parties increases, the performance drops slightly. This is probably because: Gaussian noise is added to the gradient of each party; more parties result in more added noise, which hurts the convergence of the algorithm. Third, under the same number of parties, DP-FNAS works slightly worse than FNAS. This is because FNAS is noise-free while the gradients in DP-FNAS are obfuscated with noise. However, the performance difference is very small. This shows that DP-FNAS is able to provide stronger privacy protection without substantially degrading performance. Fourth, in FNAS, as the number of parties increases, the performance becomes slightly worse. The possible reason is: as the number of parties increases, the size of data held by each party decreases. Accordingly, the gradient calculated by each party using its hosted data is biased to the data of this party. Such bias degrades the quality of model updates. Fifth, as the number of parties increases, the search cost decreases. This is not surprising since more parties can contribute more computing resources. However, the rate of cost reduction is not linear in the number of parties. This is because communication between parties incurs latency. Sixth, under the same number of parties, DP-FNAS has slightly larger search cost than FNAS. This is because adding noise renders the gradient updates less accurate, which slows down convergence. Seventh, the number of parameters and operations remain the same under different parties, with or without noise. This indicates that DP-FNAS and FNAS do not substantially alter the architectures, compared with those searched by a single machine.

Table 2 shows how the validation error of DP-FNAS with 4 parties varies with the variance of noise. As can be seen, large variance results in larger validation error. This is because noises with larger variance tend to have larger magnitude, which makes the gradient updates less accurate. However, a larger variance implies a stronger level of differential privacy. By tuning the variance of noise, we can explore a spectrum of tradeoffs between strength of privacy protection and classification accuracy.

5 Related Works

Federated NAS

There are several works independently conducted in parallel to ours on the topic of federated NAS. In (he2020fednas), each client locally performs neural architecture search. The architecture variables of different clients are synchronized to their average periodically. This approach has no convergence guarantees. In our work, different parties collaboratively search for a global architecture by exchanging gradients in each iteration, where the convergence is naturally guaranteed. In (zhu2020real), a federated algorithm is proposed to search neural architectures based on the evolutionary algorithm (EA), which is computationally heavy. In our work, a gradient-based search algorithm is used, which has lower computational cost. In (xu2020neural), the search algorithm is based on NetAdapt (yang2018netadapt), which adapts a pretrained model to a new hardware platform, where the performance of the searched architecture is limited to that of the pretrained model. In our work, the search is performed in a large search space rather than constrained by a human-designed architecture.

Federated Learning

Federated learning (FL) is a decentralized learning paradigm which enables multiple parties to collaboratively train a shared model by leveraging data from different parties while preserving privacy. Please refer to (li2019federated) for an extensive review. One key issue in FL is how to synchronize the different parameter copies among parties. One common approach is periodically setting different copies to their average (mcmahan2016communication), which however has no convergence guarantees. Client-server-based architectures guarantee convergence by exchanging gradients and models between servers and clients, but incur high communication overhead. Konečnỳ et al. (konevcny2016federated) proposed two ways to reduce communication costs: learning updates from a restricted space parametrized using a smaller number of variables and compressing updates using quantization, random rotations, and subsampling.

Neural Architecture Search

Neural architecture search (NAS) has achieved remarkable progress recently, which aims at searching for the optimal architecture of neural networks to achieve the best predictive performance. In general, there are three paradigms of methods in NAS: reinforcement learning (RL) approaches (zoph2016neural; pham2018efficient; zoph2018learning), evolutionary learning approaches (liu2017hierarchical; real2019regularized), and gradient-based approaches (cai2018proxylessnas; liu2018darts; xie2018snas). In RL-based approaches, a policy is learned to iteratively generate new architectures by maximizing a reward which is the accuracy on the validation set. Evolutionary learning approaches represent the architectures as individuals in a population. Individuals with high fitness scores (validation accuracy) have the privilege to generate offspring, which replaces individuals with low fitness scores. Gradient-based approaches adopt a network pruning strategy. On top of an over-parameterized network, the weights of connections between nodes are learned using gradient descent. Then weights close to zero are later pruned.

Differential Privacy

Rajkumar and Agarwal (rajkumar2012differentially) developed differentially-private machine learning algorithms in a distributed multi-party setting. A client-server architecture is used to aggregate gradients computed by individual parties and synchronize different parameter copies. The gradient calculated in each iteration by each party is added with two sources of random noise: (1) party-dependent and iteration-independent random noise; (2) party-independent and iteration-dependent random noise. Agarwal et al. (agarwal2018cpsgd)

studied distributed stochastic gradient descent algorithms that are both computationally efficient and differentially private. In their algorithm, clients add their share of the noise to their gradients before transmission. Aggregation of gradients at the server results in an estimate with noise equal to the sum of the noise added at each client. Geyer et al. 

(geyer2017differentially) proposed an algorithm for preserving differential privacy on clients’ side in federated optimization, by concealing clients’ contributions during training and balancing the trade-off between privacy loss and model performance.

6 Conclusions and Future Works

In this paper, we study differentially private federated neural architecture search (DP-FNAS), where multiple parties collaboratively search for a highly-performing neural architecture by leveraging the data from different parties, with strong privacy guarantees. DP-FNAS performs distributed gradient-based optimization of architecture variables and weight parameters using a parameter server architecture. Gradient updates are obfuscated with random Gaussian noise to achieve differential privacy. We provide theoretical guarantees of DP-FNAS on privacy preservation. Experiments on varying numbers of parties demonstrate that our algorithm can search neural architectures which are as good as those searched on a single machine while preserving privacy of individual parties. For future works, we aim to reduce the communication cost in DP-FNAS, by developing methods such as gradient compression, periodic updates, diverse example selection, etc.

Appendix A Proof of Theorem 3.1

a.1 Proof sketch

Here we first present a proof sketch. For the detailed proof, please refer to Section A.2. Algorithm 1 in the main paper has two instances of gradient sharing steps, one for optimizing the weight parameters , and the other for the architecture parameters . The gradient for is calculated using training data, while that for is calculated using validation data. These two steps in each iteration include two randomized mechanisms, namely and which are perturbed gradients w.r.t. to and respectively. We leverage the fact that the two mechanisms have query functions which are querying on two different datasets with disjoint data points, i.e., the training set will not contain information about individuals which are part of the validation set and vice versa. This limits the association of privacy risk for any individual with only one of the two datasets. Also we know that composition is concerned with a sequence of analysis on the same dataset where each analysis is informed by the exploration of prior analysis. Hence, composition of these two mechanisms over each iteration will not affect the privacy bounds of each other. In that sense, the compositions 8 and 9 decouple as 10 and 11 respectively for any party as shown:


where represents a randomized mechanism for gradient w.r.t. at the iteration for a party . It takes previous mechanisms ( via and via ) as inputs. Similarly, represents a randomized mechanism for gradient w.r.t. at the iteration for a party . The above expression is to suggest the recursive phenomena as also evident from Algorithm 1 in the main paper. With these in place, we can argue that the two mechanisms are composing independently along the direction of the iterations for each party. (Note that we ignored the presence of validation set () in the same way we ignore that of datasets from other parties () since in both scenarios the datasets are presumably disjoint to .)

Note that adding or removing one individual would change the value of or (from Algorithm 1 in the main paper) by at most or (clipping constants) in the norm due to the clipping operation. Hence the query function for mechanisms and has sensitivity and respectively. The major role played by clipping constants reflects in the accuracy achieved by the algorithm. We also subsample the dataset for computing gradients at both instances. We perform Poisson subsampling by choosing a data point with probability for making a place in the mini-batch used for gradient computation. This gives us the subsampled randomized mechanisms and similar to the one in (bu2019deep). The above analysis has translated our problem into two instances of the problem in (bu2019deep). This allows us to leverage the results from (bu2019deep) for each of these compositions independently, which completes the proof of Theorem 1 in the main paper.

a.2 Detailed proof

Writing , the definition of tradeoff function says that is the minimum type II error among all tests at significance level . Self-evidently, the larger the trade-off function is, the more difficult the hypothesis testing problem is (hence more privacy). With this intuition we have the following privacy definition,

Definition A.1

A (randomized) algorithm M is -differentially private if:

for all neighboring datasets and .

In this definition, the inequality holds pointwise for all , and we abuse notation by identifying and with their associated distributions. We have the following relation of -DP with -DP from (wasserman2008statistical),

Definition A.2

(wasserman2008statistical) -DP is a special instance of -DP in the sense that an algorithm is -DP iff it is -DP with (for all )

Definition A.3

Consider privately releasing a univariate statistic . The Gaussian mechanism adds noise to the statistic , which gives -GDP if . Here the sensitivity of is defined as , where the supremum is over all neighboring datasets.

Definition A.4

(Binary Function) Given trade-off functions , the binary function is defined as .

A central limit theorem phenomenon arises in the composition of many “very private” -DP algorithms in the following sense: the trade-off functions of small privacy leakage accumulate to for some under composition. More formally stated as,

Lemma A.1

(-DP composition theorem) Assuming each is very close to , which corresponds to perfect privacy, then we have

when T is very large and is a binary function.

As an important fact, the privacy bound cannot be improved in general.
According to Poison subsampling, for each point in the dataset , any point makes to the subsample independently with probability . The resulting subsample is denoted by . Given any algorithm , denote by the subsampled algorithm.

Lemma A.2

(-DP Subsampling theorem) Let be -DP, write for , and denote by (), the subsampled algorithm is -DP, where represents double conjugate.

The privacy bound is larger than and cannot be improved in general.

Lemma A.3

According to the central limit theorem, when for a constant , then as ,

where .

Theorem A.1

Given an optimization algorithm with a general deep neural network loss function, a Gaussian mechanism with noise variance , where is the gradient clipping constant and also the sensitivity of the mechanism’s query function, and

is the variance of the noise random variable. The algorithm along with Possion subsampling (

) for gradient computation at each iteration, composed over iterations achieves the following privacy guarantee,

The query function for the Gaussian mechanism is the gradient of a general neural network loss evaluated w.r.t. any model parameters to be optimized. The sensitivity of the query function is given to be

. The standard deviation of the added noises is

. According to definition A.3, it is ensured that is -GDP. As per the arguments in the Appendix of (bu2019deep) (using composition and subsampling theorem for -DP), is -DP with (composition over iterations). Using lemma A.3,

Hence, the algorithm with composition of subsampled algorithm over iterations is -GDP (from lemma A.3).