Differentially Private Meta-Learning

by   Jeffrey Li, et al.
Carnegie Mellon University

Parameter-transfer is a well-known and versatile approach for meta-learning, with applications including few-shot learning, federated learning, and reinforcement learning. However, parameter-transfer algorithms often require sharing models that have been trained on the samples from specific tasks, thus leaving the task-owners susceptible to breaches of privacy. We conduct the first formal study of privacy in this setting and formalize the notion of task-global differential privacy as a practical relaxation of more commonly studied threat models. We then propose a new differentially private algorithm for gradient-based parameter transfer that not only satisfies this privacy requirement but also retains provable transfer learning guarantees in convex settings. Empirically, we apply our analysis to the problem of federated learning with personalization and show that allowing the relaxation to task-global privacy from the more commonly studied notion of local privacy leads to dramatically increased performance in recurrent neural language modeling.


page 1

page 2

page 3

page 4


DiPSeN: Differentially Private Self-normalizing Neural Networks For Adversarial Robustness in Federated Learning

The need for robust, secure and private machine learning is an important...

Efficient Differentially Private Secure Aggregation for Federated Learning via Hardness of Learning with Errors

Federated machine learning leverages edge computing to develop models fr...

Private Multi-Task Learning: Formulation and Applications to Federated Learning

Many problems in machine learning rely on multi-task learning (MTL), in ...

Generative Models for Effective ML on Private, Decentralized Datasets

To improve real-world applications of machine learning, experienced mode...

Differentially Private Distributed Learning for Language Modeling Tasks

One of the big challenges in machine learning applications is that train...

Differentially Private Learning with Adaptive Clipping

We introduce a new adaptive clipping technique for training learning mod...

A Theoretical Perspective on Differentially Private Federated Multi-task Learning

In the era of big data, the need to expand the amount of data through da...

1 Introduction

The field of meta-learning

offers promising directions for improving the performance and adaptability of machine learning methods. At a high level, the key assumption leveraged by these approaches is that the

sharing of knowledge gained from individual learning tasks can help catalyze the learning of similar unseen tasks. However, the collaborative nature of this process, in which task-specific information must be sent to and used by a meta-learner, also introduces inherent data privacy risks.

In this work, we focus on a popular and flexible meta-learning approach, parameter transfer via gradient-based meta-learning (GBML). This set of methods, which includes well-known algorithms such as MAML [14] and Reptile [24], tries to learn a common initialization over a set of tasks such that a high-performance model can be learned in only a few gradient-steps on new tasks. Notably, information flows constantly between training tasks and the meta-learner as learning progresses; to make iterative updates, the meta-learner obtains feedback on the current by training task-specific models with it.

Meanwhile, in many settings it is crucial to ensure that sensitive information in each task-specific dataset stays private. Examples of this include learning models for next-word prediction on cell phone data [23], clinical predictions using hospital records [29], and fraud detectors for competing credit card companies [27]. In such cases, each data-owner can benefit from information learned from other tasks, but each also desires, or is legally required, to keep their raw data private. Thus, it is not sufficient to learn a well-performing ; it is equally imperative to ensure that a task’s sensitive information is not obtainable by anyone else.

While parameter transfer algorithms can move towards this goal by peforming task-specific optimization locally, thus preventing direct access to private data, this provision is far from fail-safe in terms of privacy. A wealth of work has shown in the single-task setting that it is possible for an adversary with only access to the model to learn detailed information about the training set, such as the presence or absence of specific records [25] or the identities of sensitive features given other covariates [16]. Furthermore, Carlini et al. [8]

showed that deep neural networks can effectively memorize user-unique training examples, which can be recovered even after only a single epoch of training. As such, in parameter-transfer methods, the meta-learner or any downstream participant can potentially recover data from a previous task.

However, despite these serious risks, privacy-preserving meta-learning has remained largely an unstudied problem. Our work aims to address this issue by applying differential privacy (DP) [13], a well-established definition of privacy with rich theoretical guarantees and consistent empirical success at preventing leakages of data [8, 16, 18]. Crucially, although there are various threat models and degrees of DP one could consider in the meta-learning setting (as we outline in Section 2), we balance the well-documented trade-off between privacy and model utility by formalizing and focusing on a setting that we call task-global DP. This setting provides a strong privacy guarantee for each task-owner that sharing with the meta-learner will not reliably reveal anything about specific training examples to any downstream agent. It also allows us to use the framework of Khodak et al. [20] to provide a DP GBML algorithm that enjoys provable learning guarantees in convex settings.

Finally, we show an application of our work by drawing connections to federated learning (FL). While standard methods for FL, such as FedAvg [22], have inspired many works also concerning DP in a multi-user setup [1, 12, 17, 23, 28], we are the first to consider task-global DP as a useful variation on standard DP settings. Moreover, these works fundamentally differ from ours in that they do not consider a task-based notion of learnability as they aim to learn a single global model (since by design they focus on the global federated learning problem). That being said, a federated setting involving per-user personalization [10, 26] is a natural meta-learning application.

More specifically, our main contributions are:

  1. We are the first to taxonomize the different notions of DP possible for meta-learning; in particular, we formalize on a variant we call task-global DP, showing and arguing that it adds a useful option to commonly studied settings in terms of trading privacy and accuracy.

  2. We propose the first DP GBML algorithm, which we construct to satisfy this privacy setting. Further, we show a straightforward extension for obtaining a group DP version of our setting to protect multiple samples simultaneously.

  3. While our privacy guarantees hold generally, we also prove learning-theoretic results in convex settings. Our learning guarantees scale with task-similarity, as measured by the closeness of the task-specific optimal parameters [11, 21].

  4. We show that our algorithm, along with its theoretical guarantees, naturally carries over to federated learning with personalization. Compared to previous notions of privacy considered in works for DP federated learning [1, 5, 17, 23, 28], we are, to the best of our knowledge, the first to simultaneously provide both privacy and learning guarantees.

  5. Empirically, we demonstrate that our proposed privacy setting allows for strong performance on non-convex federated language modeling tasks. We achieve close to the performance of non-private models and significantly improve upon the performance of models trained with local-DP guarantees, a previously studied notion that also provides protections against the meta-learner. Our setting reasonably relaxes this latter notion but can achieve roughly times the performance on a modified version of the Shakespeare dataset [7] and times the performance on a modified version of Wiki-3029 [2].

1.1 Related Work

DP Algorithms in Federated Learning Settings. Works most similar to ours focus on providing DP for federated learning. Specifically, Geyer et al. [17] and McMahan et al. [23] apply update clipping and the Gaussian Mechanism to achieve global DP federated learning algorithms for language modeling and image classification tasks, respectively. Their methods are shown to only suffer minor drops in accuracy compared to non-private training but they do not consider protections to inferences made by the meta-learner. Alternatively, Bhowmick et al. [5] does achieve such protection by applying a theoretically rate-optimal local DP mechanism on the ’s users send to the meta-learner. However, they sidestep hard minimax rates [12] by assuming limited adversaries have limited side-information and allowing for a large privacy budget. In this work, though we achieve a relaxation of the privacy of Bhowmick et al. [5], we do not restrict the adversary’s power. Finally, Truex et al. [28] does consider a setting that coincides with task-global DP, but they focus primarily on the added benefits of applying MPC (see below) rather than studying the merits of the setting in comparison to other potential settings. Although these approaches all study privacy through the lens of learning a single global model, many of them, as well as our proposed GBML algorithm, are naturally amenable to a federated learning setting with personalization.

Secure Multiparty Computation (MPC). MPC is a cryptographic technique that allows parties to calculate a function of their inputs while also maintaining the privacy of each individual party’s inputs [6]. In GBML, sets of model updates may come in a batch from multiple tasks, and hence MPC can securely aggregate the batch before it is seen by the meta-learner. Though MPC itself gives no DP guarantees, it prevents the meta-learner from directly accessing any one task’s updates and can thus be combined with DP to increase privacy. Analogues of this approach have been studied in the federated setting, e.g. by Agarwal et al. [1], who apply SMC in the same difficult setting of Bhowmick et al. [5], and Truex et al. [28], who apply SMC similarly to a setting analogous to ours. On the other hand, MPC also comes with additional practical challenges such as peer-to-peer communication costs, drop outs, and vulnerability to collaborating participants. As such, combined with its applicability to multiple settings, including ours, we consider MPC to be an orthogonal direction.

2 Privacy in a Meta-Learning Context

In this section, we first formalize the meta-learning setting that we consider. We then describe the various threat models that arise in the GBML setup, before presenting the different DP notions that can be achieved. Finally, we highlight the specific model and type of DP that we analyze.

2.1 Parameter Transfer Meta-Learning

In parameter transfer meta-learning, we assume that there is a set of learning tasks , each with its corresponding disjoint training set . Each contains training examples where each . The goal within each task is to learn a function parameterized by that performs “well,” generally in the sense that it has low within-task population risk in the distributional setting. The meta-learner’s goal is to learn an initialization that leads to a well-performing within-task. In GBML this is learned via an iterative process that alternates between the following two steps: (1) a within-task procedure where a batch of task-owners receives the current and each uses as an initialization for running a within-task optimization procedure, obtaining ; (2) a meta-level procedure where the meta-learner receives these model updates and aggregates them to determine an updated .

Notably, since both sub-procedures only need to receive the output from the other, an overall GBML algorithm can modularly change each one. From a privacy standpoint, even if the within-task sub-procedure is always done locally, specific information about is vulnerable to being inferred by anyone who receives , namely the meta-learner. Similarly, the meta-level procedure can potentially reveal sensitive information about previously seen task-owner’s through revealing to future recipients of , thus leaving task-owners vulnerable to each other.

2.2 Threat Models for GBML

As in any privacy endeavor, before discussing particular mechanisms, a key specification must be made in terms of what threat model is being considered. In particular, it must be specified both (1) who the potential adversaries are and (2) what information needs to be protected.

Potential adversaries. For a single task-owner, adversaries may be either solely recipients of (i.e. other task-owners) or recipients of either or (i.e. also the meta-learner). In the latter case, we consider only a honest-but-curious meta-learner, who does not deviate from the agreed upon algorithm but may try to make inferences based on the information it receives. In both cases, concern is placed not only about the intentions of these other participants, but also their own security against access by malicious outsiders.

Data to be protected. A system can choose either to protect information contained in single records one-at-a-time or to protect entire datasets simultaneously. This distinction between record-level and task-level privacy can be practically important. Multiple within may reveal the same secret (e.g., a cell-phone user has sent their SSN multiple times), or the entire distribution of

could reveal sensitive information (e.g., a user has sent all messages in a foreign language). In these cases, record-level privacy may not be sufficient. However, given that privacy and utility are often at odds, we often seek the weakest notion of privacy needed in order to best preserve utility.

In related work, focus has primarily been placed on task-level protections. However, these works usually fall into two extremes, either obtaining strong learning but having to trust the meta-learner [23, 17] or trusting nobody but also obtaining low performance [5]. In response, we try to bridge the gap between these threat models by considering a model that makes a relaxation from task-level to record-level privacy but retains protections for each task-owner against all other parties. This relaxation can be reasonably justified in practical situations, as while task-level guarantees are strictly stronger, they may also be unnecessary. In particular, record-level guarantees are likely to be sufficient whenever single records each pertain to different individuals. For example, for hospitals, what we care about is providing privacy to the individual patients and not aggregate hospital information. For cell-phones, if one can bound the number of texts that could contain the same sensitive information, then an straightforward extension of our setting and methods, which protects up to records simultaneously, could also be sufficient.

Figure 1: Summary of the privacy protections guaranteed by local and global DP at the different levels of the meta-learning problem (with our notion in blue). On the right, we show what each specification would mean in two practical federated scenarios: mobile users and hospital networks.

2.3 Differential Privacy (DP) in a Single-Task Setting

In terms of actually achieving privacy guarantees for machine learning, a de-facto standard has been to apply DP, a provision which strongly limits what one can infer about the examples a given model was trained on. Assuming a training set , two common types of DP are considered.

Differential Privacy (Global DP). A randomized mechanism is -differentially private if for all measurable and for all datasets that differ by at most one element:

If this holds for differing by at most elements, then -group DP is achieved.

Local Differential Privacy. A randomized mechanism is -locally differentially private if for any two possible training examples and measurable :

Global DP guarantees that it will be hard to infer the presence of a specific record in the training set by observing the output of . It assumes a trusted aggregator running gets to see directly and then privatizes the final output (usually by adding noise throughout training). On the other hand, local DP assumes a stronger threat model in which the aggregator also cannot be trusted. Thus, a random mechanism must be applied individually on each before the aggregator sees it. Local DP is a stronger guarantee as being -locally DP implies being -global DP by invariance to post-processing [13], but it also generally results in worse model performance, since it suffers from provably hard minimax rates Duchi et al. [12].

2.4 Differential Privacy for a GBML Setting

In meta-learning, there exists a hierarchy of agents and statistical queries, so we cannot as simply define global and local DP. Here, both the meta-level sub-procedure ,, and the within-task sub-procedure, , can be considered individual queries and a DP algorithm can implement either to be DP. Further, for each query, the procedure may be altered to satisfy either local DP or global DP. Thus, there are four fundamental options that follow from standard DP definitions.

  1. Global DP: Releasing will at no point compromise information regarding any specific .

  2. Local DP: Additionally, each is protected from being revealed to the meta-learner.

  3. Task-Global DP: Releasing will at no point compromise any specific .

  4. Task-Local DP: Additionally, each is protected from being revealed to task-owner.

To form analogies to single-task DP, the examples in the meta-level procedure are the model updates and the aggregator is the meta-learner. For the within-task procedure, the examples are actually the individual records and the aggregator is the task-owner. As such, (1) is implemented by the meta-learner, (2) and (3) are implemented by the task-owner, and (4) is implemented by record-owners.

By immunity to post-processing, the guarantees for (3) and (4) also automatically apply to the release of any future iteration of , thus protecting against future task-owners as well. Meanwhile, though (1) and (2) by definition protect the identities of individual , they actually mask the entire presence or absence of any task, thus satisfying a task-level threat model. Intuitively, not being able to infer anything about implies that nothing can be inferred about the that was used to generate it.

As a consequence, we can thus directly compare versions of (2) and (3) since both are mechanisms implemented by task-owners. Indeed, as we prove in the Appendix, we have:

Remark 2.1.

If a GBML algorithm achieves -local DP at the meta-level, it is also guaranteed to be -task-global DP.

The converse, on the other hand, is not generally true, as while some task-global-DP mechanisms may result in a local-DP guarantee, the particular will not necessarily carry over. Both ensure that each task-owner has guarantee for releasing , but achieving local DP implies a task-level guarantee at the within-task level, while global DP at a within-task level may only provide record-level guarantees.

Previous Work Notion of DP Privacy for Privacy for
McMahan et al. [23] Global Task-level -
Geyer et al. [17] Global Task-level -
Bhowmick et al. [5] Local, Global Task-level Task-level
Agarwal et al. [1] Local + MPC Task-level Task-level
Truex et al. [28] Task-Global + MPC Record-level Record-level
Our work Task-Global Record-level Record-level
Table 1: Broad categorization of the DP settings considered by our work in meta-learning and notable past works in the federated setting.

2.5 Task-Global DP in Comparison to Previous Works

Using the terminology we introduce in Section 2.4, previous works for DP in federated settings can be categorized as in Table 1. While these works do not assume a multi-task setting, the terms global/local and task-global/task-local can still analogously refer to releasing the global model (done by the central server) and user-specific updates (done locally on users’ devices), respectively.

Geyer et al. [17] and McMahan et al. [23] both directly provide global DP boosted by sub-sampling and show how they can achieve performance very close to non-private training. However, this privacy guarantee may be provide may be fundamentally insufficient if there is reason to distrust the central server or the security of accessing its computations. Task-global DP, in comparison provides record-level protections but it does protect against these additional potential adversaries.

In contrast, both Bhowmick et al. [5] and Agarwal et al. [1] provide some form of local DP. a strictly stronger setting than task-global DP. However, Bhowmick et al. [5] shows it needs to concede a very large privacy budget in order to achieve reasonable performance. Agarwal et al. [1] also consider what is effectively a local DP guarantee but leverages SMC to reduce the amount of randomization needed per task-owner. As mentioned in 1.1, this comes with additional practical challenges and is somewhat of an orthogonal direction. Indeed, Truex et al. [28] considers applying SMC to improve what are inherently task-global DP mechanisms.

Lastly, we remark that local-within-task DP has not previously been studied as protecting individual data points from task-owners is something that is unlikely to be a concern (eg. cell phone users already own their text messages and one would assume patients already trust their hospitals).

Overall, in contrast to past works, we note that we are the first to formalize and consider the advantages and trade-offs implicit in privatizing the within-task algorithm. Additionally, we show task-global DP is the first notion of DP that can be shown to enjoy any form of provable meta-learning guarantees (Section 3) and also that it empirically improves upon local DP in terms of utility (Section 4).

3 Differentially Private Parameter-Transfer

3.1 Algorithm

We now present our DP GBML method, which is written out in its online (regret) form in Algorithm 1. Here, we observe that both within-task optimization and meta-optimization are done using some form of gradient descent. The key difference between this algorithm and traditional GBML is that since task-learners must send back privatized model updates, each now applies an DP gradient descent procedure to learn when called. However, at meta-test time the task-learner will run a non-private descent algorithm to obtain the parameter used for inference, as this parameter does not need to be sent to the meta-learner. To obtain learning-theoretic guarantees, we use a variant of Algorithm 1 in which the DP algorithm is an SGD procedure [3, Algorithm 1]

that adds a propertly scaled Gaussian noise vector at each iteration. A stability result due to

Bassily et al. [3] regarding the population loss of this algorithm’s output allows us to provide bounds on the transfer risk due to our meta-algorithm.

Meta-learner picks first meta-initialization .
for task  do
       Meta-learner sends meta-initialization to task .
       Task-learner runs OGD starting from on losses , suffering regret .
       Task-learner runs -DP descent algorithm on losses to get .
       Task-learner sends to meta-learner.
       Meta-learner constructs loss .
       Meta-learner picks meta-initialization using an OCO algorithm on .
Algorithm 1 Online version of our -meta-private parameter-transfer algorithm.

3.2 Privacy Guarantees

We run a certified -DP version of SGD [3, Algorithm 1] within each task. Therefore, this guarantees that the contribution of each task-owner, a trained on their data, carries global DP guarantees with respect to the meta-learner. Additionally, since DP is preserved under post-processing, the release of any future calculation stemming from also carries the same DP guarantee.

3.3 Learning Guarantees

Our learning result follows the setup of Baxter [4], who formalized the LTL problem as using task-distribution samples from some meta-distribution and samples indexed by from those tasks to improve performance when a new task is sampled from and we draw samples from it. In the setting of parameter-transfer meta-learning we are learning functions parameterized by real-valued vectors , so our goal will follow that of Denevi et al. [11] and Khodak et al. [21] in seeking bounds on the transfer-risk – the distributional performance of a learned parameter on a new task from – that improve with task similarity.

The specific task-similarity metric we consider is the average deviation of the risk-minimizing parameters of tasks sampled from the distribution are close together. This will be measured in-terms of the following quantity: , for a risk-minimizer of task-distribution

. This quantity is roughly the variance of risk-minimizing task-parameters and is a standard quantifier of improvement due to meta-learning

[11, 21]. For example, Denevi et al. [11] show excess transfer-risk guarantees of the form when tasks with samples are drawn from the distribution. This guarantee ensures that as we see more tasks our transfer risk becomes roughly , which if the tasks are similar, i.e. is small, implies that LTL improves over single-task learning.

In Algorithm 1, each user obtains a within-task parameter by running (non-private) OGD on a sequence of losses and averaging the iterates. The regret of this procedure, when averaged across the users, implies a bound on the expected excess transfer risk of new task from when running OGD from a learned initialization [9]. Thus our goal is to bound this regret in terms of ; here we follow the Average Regret-Upper-Bound Analysis (ARUBA) framework of Khodak et al. [21] and treat meta-update procedure itself as an online algorithm optimizing a bound on the performance measure (regret) of each within-task algorithm. As OGD’s regret depends on the squared distance of the optimal parameter from the initialization , with no privacy concerns one could simply update using to recover guarantees similar to those in Denevi et al. [11] and Khodak et al. [21].

However, this approach requires sending to the meta-learner, which is not private; instead in Algorithm 1 we send , which is the output of noisy SGD. To apply ARUBA, we need an additional assumption – that the losses satisfy the following quadratic growth (QG) property: for some ,


Here is the risk minimizer of . This assumption, which Khodak et al. [20]

show is reasonable in settings such as logistic regression, amounts to a statistical non-degeneracy assumption on the parameter-space – that parameters far away from the risk-minimizer do not have low-risk. Note that QG is significantly weaker than strong convexity, which previous work

[15] has assumed to hold for task losses but does not hold for applicable cases such as few-shot least-squares or logistic regression if the number of task-samples is smaller than the data-dimension.

We are now able to state our main theoretical result, a proof of which is given in Appendix B. The result follows from a bound on the task-average regret (TAR) across all tasks of a simple online meta-learning procedure that treats the update sent by each task as an approximation of the optimal parameter in hindsight . Since this parameter determines regret on that task, by reducing the meta-update procedure to OCO on this sequence of functions in a manner similar to [20], we are able to show a task-similarity-dependent bound on the TAR. Following this the statistical guarantee stems from a nested online-to-batch conversion, a standard procedure to convert low-regret online-learning algorithms to low-risk distribution-learning algorithms.

Theorem 3.1.

Suppose is a distribution over task-distributions over -Lipschtz,

-Lipschitz-smooth, 1-bounded convex loss functions

over parameter space with diameter , and let each satisfy the quadratic growth property (1). Suppose the distribution of each task is sampled i.i.d. from and we run Algorithm 1 with the -DP procedure of Bassily et al. [3, Algorithm 1] to obtain as the average iterate for the meta-update step. Then if for and we have the following bound on the expected transfer risk when a new task is sampled from , samples are drawn i.i.d. from , and we run OGD with learning rate starting from and use the average of the resulting iterates as the learned parameter:

Here is any element of and the outer expectation is taken over and the randomness of the within-task DP mechanism. Note that this procedure is -DP.

Theorem 3.1 shows that one can usefully run a DP-algorithm as the within-task method in meta-learning and still obtain improvement due to task-similarity. Specifically, the standard term of is multiplied by , which is small if the tasks are related via the closeness of their risk minimizers. Thus we can use meta-learning to improve within-task performance relative to single-task learning. We also obtain a very fast convergence of in the number of tasks. However, we do gain some terms due to the quadratic growth approximation and the privacy mechanism. Note that the assumption that both the functions and its gradients are Lipschitz-continuous are standard and required by the noisy SGD procedure of Bassily et al. [3].

This theorem also gives us a relatively straightforward extension if the desire is to provide -group-DP. Since any privacy mechanism that provides -DP also provides -DP guarantees for groups of size [13], we immediately have the following corollary.

Corollary 3.1.

Under the same assumptions and setting as Theorem 3.1, achieving -group DP is possible with the same guarantee except replacing with .

For constant , this allows us to enjoy the stronger guarantee while maintaining largely the same learning rates. This is a useful result given that in some settings, it may be desired to simultaneously protect small groups of size , such as protecting entire families for hospital records.

4 Empirical Results

In this section, we present results that show it is possible to learn useful deep models in federated scenarios while still preserving task-global privacy. In particular, our focus is to evaluate the performance of models that have been optimzied with a task-global DP algorithm in comparison to models that are trained both non-privately and models that were trained with the previously more commonly studied local DP. To this end, we evaluate performance of a LSTM RNN for language modeling tasks and apply a practical variant of Algorithm 1 that considers both tasks and within-task examples in batches instead of serially. To obtain within-task privacy, we alter the within-task algorithm to be DP-SGD algorithm as implemented by TensorFlow Privacy111https://github.com/tensorflow/privacy and to obtain local privacy we use a modification of [23] where each task separately applies a Gaussian Mechanism on a single before sending model updates to the meta-learner.


We train a next word predictor for two federated datasets: (1) The Shakespeare dataset as preprocessed by [7], and (2) a dataset constructed from Wikipedia articles drawn from the Wiki-3029 dataset [2], where each article is used as a different task. For each dataset, we set a fixed number of tokens per task, discard tasks with fewer tokens than the specified, and discard samples from those tasks with more. We set the number of tokens per task to for Shakespeare and to for Wikipedia, divide tokens into sequences of length , and we refer to these modified datasets as Shakespeare-800 and Wiki-1600.

Meta Learning Algorithm.

We study the performance of our method when applied to the batch version of Reptile [24] (which, in our setup, reduces to personalized Federated Averaging when the meta-learning rate is set to

). We tune various configurations of task batch size for all methods and for the non-private baseline, we also tune for multiple visits per client since there is no privacy degradation to account for. Additionally, we implement an exponential decay on the meta learning rate. We defer a full discussion of hyperparameter tuning to the Appendix.

Privacy Considerations.

For the task-global DP models, we set for each task in both Shakespeare-800; and Wiki-1600 and we implement it the tools provided by TensorFlow Privacy. Although their mechanism differs from the one presented in Section 3, it still lets us explore task-global privacy in a realistic setting. We use the the RDP accountant provided in order to keep track of our privacy budget, obtaining for both Shakespeare and Wikipedia. Finally, for both datasets, we make sure that all tasks and their samples are only seen once, as we cannot leverage any sub-sampling results if the meta-learner can directly see who is sending updates each round.

For the local-DP training, even though this notion of DP is stronger, we explore the same privacy budgets so as to obtain guarantees that are of the same confidence. Here, we run the DP-FedAvg algorithm from [23]

with two key changes. First, we add Gaussian noise to each clipped set of model updates before returning them to the central server instead of after aggregation. Secondly, we iterate through tasks without replacement with a fixed batch size rather than sampling each task with independent probability in each new round. The first change gives us local DP, and the second is necessary since multiple visits to a single client results in significant degradation of the privacy guarantee and we want each client to end up with the same final privacy parameters.


Figure 2 shows the performance of both the non-private and task-global private versions of Reptile [24] on the Shakespeare and Wikipedia datasets. As expected, in neither case does the private (noised) version reach the same accuracy of the non-private (noiseless) version of the algorithm. Nonetheless, the private version still comes within of the non-private accuracy for Shakespeare-800 and within for Wiki-1600. Meanwhile achieving local-meta-level results in only about of the non-private accuracy on both Shakespeare-800 and Wiki-1600.

In practice, these differences could be toggled by changing the privacy budget for the algorithm, or for a given privacy budget, trading off more training iterations for larger noise multipliers.

Figure 2:

Performance of different versions of Reptile on a next-word-prediction task for two federated datasets. We report the test accuracy on unseen tasks and repeat each experiment 10 times. Solid lines correspond to means, colored bands indicate 1 standard deviation, and dotted lines are for comparing final accuracies (privatized versions of the algorithm are trained only one visit per client).

5 Conclusions

In this work, we have outlined and studied the issue of privacy in the context of meta-learning. Focusing on the class of gradient-based parameter-transfer methods, we used differential privacy to address the privacy risks posed to task-owners by sharing task-specific models with a central meta-learner. To do so, we formalized and considered the notion of task-global

differential privacy, which guarantees that individual examples from the tasks are protected from all downstream agents (and particularly the meta-learner). Working in this privacy model, we developed a differentially private algorithm that guarantees both this strong protection as well as learning-theoretic results in the convex setting. Finally, we demonstrate how this notion of privacy can translate into useful deep learning models for federated tasks.


This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services Award, a JP Morgan A.I. Research Faculty Award, and a Carnegie Bosch Institute Research Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.


Appendix A Local-Meta-Level DP and task-global DP

Remark A.1.

If a GBML algorithm achieves -local DP at the meta-level, it is also guaranteed to be -DP at a task-global level.


According to the definition of local DP, a mechanism that achieves -local DP for releasing must satisfy for any and :

Here can also be seen as a function, possibly stochastic, of , or more formally, where is an initialization and . Thus, by also setting , we automatically get for any

This holds by definition when is deterministic since and are single elements from . When and are stochastic, this bound also holds since it holds even in the worst case for any single pair of elements in Further, the bound holds no matter how many elements differ between and , as long as outputs something in . Thus, if we treat as one mechanism, we get the given proposition.

Appendix B Proofs of Learning Guarantees

Throughout this section we assume all subsets are convex and in unless explicitly stated. In the online learning setting we will use the shorthand to denote the subgradient of evaluated at action . For any we will use to refer to the sum of the first of them.

In this section we first prove (Theorem B.1) a general averaged-regret bound following the ARUBA framework of Khodak et al. [21]. We then combine an algorithmic stability based -DP generalization bound for noisy SGD of Bassily et al. [3] with a quadratic growth assumption [19, 20] to show that such an algorithm returns a meta-update parameter that is close and thus suffices to show a meaningful task-averaged-regret guarantee (Corollary B.1). We conclude by using this bound to derive a guarantee in the statistical LTL setting (Corollary B.2).

Meta-learner picks first meta-initialization .
for task  do
       Meta-learner sends meta-initialization to task .
       Task-learner runs OGD starting from on losses , suffering regret .
       Task-learner runs -DP descent algorithm on losses to get .
       Task-learner sends to meta-learner.
       Meta-learner constructs loss .
       Meta-learner picks meta-initialization using an OCO algorithm on .
Algorithm 2 Online version of our -meta-private parameter-transfer algorithm.
Setting B.1.

We assume all functions are convex and -Lipschitz for some and that has -diameter . We define the following quantities:

  • convenience coefficients

  • the sequence of update parameters with mean

  • a sequence of reference parameters with mean

  • a sequence of optimal parameters in hindsight

  • s.t.

  • s.t.

  • positive task-similarity

  • learning-rate for some

Theorem B.1.

In Setting B.1 define the regret upper-bound and the averaged regret upper-bound . Then in Algorithm 2 if the meta-learner uses FTL or AOGD to pick the meta-initialization and the within-task descent algorithm has regret upper-bounded by we have the following bound:

Here the expectation is taken over the randomness of the DP mechanism.


We apply the standard FTRL regret of OGD, e.g. Theorem A.1 in Khodak et al. [20], and the logarithmic regret of FTL and AOGD, e.g. Theorem A.2 in Khodak et al. [20]:

Setting B.2.

In Setting B.1, assume loss functions are generated by picking some distribution over valid losses and then sampling of them i.i.d. Assume further that the expected loss of every such distribution satisfies -quadratic-growth (-QG): for some , any , and the closest minimizer of to we have

Furthermore, assume that these losses are -strongly-smooth:

Finally, assume that is unique for every .

Lemma B.1.

Let be a sequence of convex losses drawn i.i.d. from some distribution with risk being -QG and let be any of the optimal actions in hindsight. Then the closest minimum of to satisfies


Taking expectations of the result of Lemma B.4 in Khodak et al. [20], we have for that

Lemma B.2.

Let be a sequence of -strongly-smooth, -Lipschitz convex losses drawn i.i.d. from some distribution with risk being -QG and let be the average iterate of running Algorithm 1 of Bassily et al. [3] with the appropriate parameters for obtaining -DP. If then the closest minimum of to satisfies


The result follows by directly substituting Theorem 3.2 of Bassily et al. [3] into the definition of -QG:

Proposition B.1.

In Setting B.2 we have and


We apply the triangle inequality, Jensen’s inequality, and Lemmas B.1 and B.2 to get

We further have by the triangle inequality and Lemma B.2 that

Corollary B.1.

In Setting B.2, if we run Algorithm 2 using OGD with learning rate and Algorithm 1 of Bassily et al. [3] as the within-task -DP method then for we have the following bound on the expected task-averaged regret:


Substitute Proposition B.1 into Theorem B.1 and simplify. ∎

Corollary B.2.

In Setting B.2 and under the assumptions of Corollary B.1, if the distribution of each task is sampled i.i.d. from some environment and then we have the following bound on the expected transfer risk when a new task is sampled from , samples are drawn i.i.d. from , and we run OGD with starting from and use the average of the resulting iterates as the learned parameter:

Here is any element of and the outer expectation is taken over and the randomness of the DP mechanism.


The result follows from two applications of the standard in-expectation online-to-batch argument, e.g. Proposition A.1 of Khodak et al. [20], followed by an application of Corollary B.1:

Appendix C Experiment Details


We train a next word predictor for two federated datasets: (1) The Shakespeare dataset as preprocessed by [7], and (2) a dataset constructed from Wikipedia articles, where each article is used as a different task. For each dataset, we set a fixed number of tokens per task, discard tasks with less tokens than the specified, and discard samples from those tasks with more. For Shakespeare, we set the number of tokens per task to tokens, leaving tasks for meta-training, for meta-validation, and for meta-testing. For Wikipedia, we set the number of tokens to , which corresponds to having tasks for meta-training, for meta-validation, and for meta-testing. For the meta-validation and meta-test tasks, of the tokens are used for local training, and the remaining for local testing.

Model Structure:

Our model first maps each token to an embedding of dimension before passing it through an LSTM of two layers of units each. The LSTM emits an output embedding, which is scored against all items of the vocabulary via dot product followed by a softmax. We build the vocabulary from the tokens in the meta-training set and fix its length to . We use a sequence length of for the LSTM and, just as [23], we evaluate using AccuracyTop1 (i.e., we only consider the predicted word to which the model assigned the highest probability) and consider all predictions of the unknown token as incorrect.


We tune the hyperparameters on the set of meta-validation tasks. For all datasets and all versions of the meta-learning algorithm, we tune hyperparameters in a two step process. We first tune all the parameters that are not related to refinement: the meta learning rate, the local (within-task) meta-training learning rate, the maximum gradient norm, and the decay constant. Then, we use the configuration with the best accuracy pre-refinement and then tune the refinement parameters: the refine learning rate, refine batch size, and refine epochs.

All other hyperparameters are kept fixed for the sake of comparison: full batch steps were taken on within-task data, with the maximum number of microbatches used for the task-global DP model. The parameter search spaces are given in Tables 2, 3, 4. In these tables, the final hyperparameters we used are in bold.

Hyperparameter Shakespeare- Wiki-
Visits Per Task
Tasks Per Round
Within-Task Epochs
Meta LR
Meta Decay Rate
Within-Task LR
Refine LR
Refine Mini-batch Size
Refine Epochs
Table 2: Hyperparameter Search Space for Non-Private Training
Model Shakespeare- Wiki-
Visits Per Task
Tasks Per Round
Within-Task Epochs
Meta LR
Meta Decay Rate
Within-Task LR
Refine LR
Refine Mini-batch Size
Refine Epochs
Table 3: Hyperparameter Search Space for Task-Global DP Training
Model Shakespeare- Wiki-
Visits Per Task
Tasks Per Round
Within-Task Epochs
Meta LR
Meta Decay Rate
Within-Task LR
Refine LR
Refine Mini-batch Size
Refine Epochs
Table 4: Hyperparameter Search Space for Local-DP Training