1 Introduction
The field of metalearning
offers promising directions for improving the performance and adaptability of machine learning methods. At a high level, the key assumption leveraged by these approaches is that the
sharing of knowledge gained from individual learning tasks can help catalyze the learning of similar unseen tasks. However, the collaborative nature of this process, in which taskspecific information must be sent to and used by a metalearner, also introduces inherent data privacy risks.In this work, we focus on a popular and flexible metalearning approach, parameter transfer via gradientbased metalearning (GBML). This set of methods, which includes wellknown algorithms such as MAML [14] and Reptile [24], tries to learn a common initialization over a set of tasks such that a highperformance model can be learned in only a few gradientsteps on new tasks. Notably, information flows constantly between training tasks and the metalearner as learning progresses; to make iterative updates, the metalearner obtains feedback on the current by training taskspecific models with it.
Meanwhile, in many settings it is crucial to ensure that sensitive information in each taskspecific dataset stays private. Examples of this include learning models for nextword prediction on cell phone data [23], clinical predictions using hospital records [29], and fraud detectors for competing credit card companies [27]. In such cases, each dataowner can benefit from information learned from other tasks, but each also desires, or is legally required, to keep their raw data private. Thus, it is not sufficient to learn a wellperforming ; it is equally imperative to ensure that a task’s sensitive information is not obtainable by anyone else.
While parameter transfer algorithms can move towards this goal by peforming taskspecific optimization locally, thus preventing direct access to private data, this provision is far from failsafe in terms of privacy. A wealth of work has shown in the singletask setting that it is possible for an adversary with only access to the model to learn detailed information about the training set, such as the presence or absence of specific records [25] or the identities of sensitive features given other covariates [16]. Furthermore, Carlini et al. [8]
showed that deep neural networks can effectively memorize userunique training examples, which can be recovered even after only a single epoch of training. As such, in parametertransfer methods, the metalearner or any downstream participant can potentially recover data from a previous task.
However, despite these serious risks, privacypreserving metalearning has remained largely an unstudied problem. Our work aims to address this issue by applying differential privacy (DP) [13], a wellestablished definition of privacy with rich theoretical guarantees and consistent empirical success at preventing leakages of data [8, 16, 18]. Crucially, although there are various threat models and degrees of DP one could consider in the metalearning setting (as we outline in Section 2), we balance the welldocumented tradeoff between privacy and model utility by formalizing and focusing on a setting that we call taskglobal DP. This setting provides a strong privacy guarantee for each taskowner that sharing with the metalearner will not reliably reveal anything about specific training examples to any downstream agent. It also allows us to use the framework of Khodak et al. [20] to provide a DP GBML algorithm that enjoys provable learning guarantees in convex settings.
Finally, we show an application of our work by drawing connections to federated learning (FL). While standard methods for FL, such as FedAvg [22], have inspired many works also concerning DP in a multiuser setup [1, 12, 17, 23, 28], we are the first to consider taskglobal DP as a useful variation on standard DP settings. Moreover, these works fundamentally differ from ours in that they do not consider a taskbased notion of learnability as they aim to learn a single global model (since by design they focus on the global federated learning problem). That being said, a federated setting involving peruser personalization [10, 26] is a natural metalearning application.
More specifically, our main contributions are:

We are the first to taxonomize the different notions of DP possible for metalearning; in particular, we formalize on a variant we call taskglobal DP, showing and arguing that it adds a useful option to commonly studied settings in terms of trading privacy and accuracy.

We propose the first DP GBML algorithm, which we construct to satisfy this privacy setting. Further, we show a straightforward extension for obtaining a group DP version of our setting to protect multiple samples simultaneously.

We show that our algorithm, along with its theoretical guarantees, naturally carries over to federated learning with personalization. Compared to previous notions of privacy considered in works for DP federated learning [1, 5, 17, 23, 28], we are, to the best of our knowledge, the first to simultaneously provide both privacy and learning guarantees.

Empirically, we demonstrate that our proposed privacy setting allows for strong performance on nonconvex federated language modeling tasks. We achieve close to the performance of nonprivate models and significantly improve upon the performance of models trained with localDP guarantees, a previously studied notion that also provides protections against the metalearner. Our setting reasonably relaxes this latter notion but can achieve roughly times the performance on a modified version of the Shakespeare dataset [7] and times the performance on a modified version of Wiki3029 [2].
1.1 Related Work
DP Algorithms in Federated Learning Settings. Works most similar to ours focus on providing DP for federated learning. Specifically, Geyer et al. [17] and McMahan et al. [23] apply update clipping and the Gaussian Mechanism to achieve global DP federated learning algorithms for language modeling and image classification tasks, respectively. Their methods are shown to only suffer minor drops in accuracy compared to nonprivate training but they do not consider protections to inferences made by the metalearner. Alternatively, Bhowmick et al. [5] does achieve such protection by applying a theoretically rateoptimal local DP mechanism on the ’s users send to the metalearner. However, they sidestep hard minimax rates [12] by assuming limited adversaries have limited sideinformation and allowing for a large privacy budget. In this work, though we achieve a relaxation of the privacy of Bhowmick et al. [5], we do not restrict the adversary’s power. Finally, Truex et al. [28] does consider a setting that coincides with taskglobal DP, but they focus primarily on the added benefits of applying MPC (see below) rather than studying the merits of the setting in comparison to other potential settings. Although these approaches all study privacy through the lens of learning a single global model, many of them, as well as our proposed GBML algorithm, are naturally amenable to a federated learning setting with personalization.
Secure Multiparty Computation (MPC). MPC is a cryptographic technique that allows parties to calculate a function of their inputs while also maintaining the privacy of each individual party’s inputs [6]. In GBML, sets of model updates may come in a batch from multiple tasks, and hence MPC can securely aggregate the batch before it is seen by the metalearner. Though MPC itself gives no DP guarantees, it prevents the metalearner from directly accessing any one task’s updates and can thus be combined with DP to increase privacy. Analogues of this approach have been studied in the federated setting, e.g. by Agarwal et al. [1], who apply SMC in the same difficult setting of Bhowmick et al. [5], and Truex et al. [28], who apply SMC similarly to a setting analogous to ours. On the other hand, MPC also comes with additional practical challenges such as peertopeer communication costs, drop outs, and vulnerability to collaborating participants. As such, combined with its applicability to multiple settings, including ours, we consider MPC to be an orthogonal direction.
2 Privacy in a MetaLearning Context
In this section, we first formalize the metalearning setting that we consider. We then describe the various threat models that arise in the GBML setup, before presenting the different DP notions that can be achieved. Finally, we highlight the specific model and type of DP that we analyze.
2.1 Parameter Transfer MetaLearning
In parameter transfer metalearning, we assume that there is a set of learning tasks , each with its corresponding disjoint training set . Each contains training examples where each . The goal within each task is to learn a function parameterized by that performs “well,” generally in the sense that it has low withintask population risk in the distributional setting. The metalearner’s goal is to learn an initialization that leads to a wellperforming withintask. In GBML this is learned via an iterative process that alternates between the following two steps: (1) a withintask procedure where a batch of taskowners receives the current and each uses as an initialization for running a withintask optimization procedure, obtaining ; (2) a metalevel procedure where the metalearner receives these model updates and aggregates them to determine an updated .
Notably, since both subprocedures only need to receive the output from the other, an overall GBML algorithm can modularly change each one. From a privacy standpoint, even if the withintask subprocedure is always done locally, specific information about is vulnerable to being inferred by anyone who receives , namely the metalearner. Similarly, the metalevel procedure can potentially reveal sensitive information about previously seen taskowner’s through revealing to future recipients of , thus leaving taskowners vulnerable to each other.
2.2 Threat Models for GBML
As in any privacy endeavor, before discussing particular mechanisms, a key specification must be made in terms of what threat model is being considered. In particular, it must be specified both (1) who the potential adversaries are and (2) what information needs to be protected.
Potential adversaries. For a single taskowner, adversaries may be either solely recipients of (i.e. other taskowners) or recipients of either or (i.e. also the metalearner). In the latter case, we consider only a honestbutcurious metalearner, who does not deviate from the agreed upon algorithm but may try to make inferences based on the information it receives. In both cases, concern is placed not only about the intentions of these other participants, but also their own security against access by malicious outsiders.
Data to be protected. A system can choose either to protect information contained in single records oneatatime or to protect entire datasets simultaneously. This distinction between recordlevel and tasklevel privacy can be practically important. Multiple within may reveal the same secret (e.g., a cellphone user has sent their SSN multiple times), or the entire distribution of
could reveal sensitive information (e.g., a user has sent all messages in a foreign language). In these cases, recordlevel privacy may not be sufficient. However, given that privacy and utility are often at odds, we often seek the weakest notion of privacy needed in order to best preserve utility.
In related work, focus has primarily been placed on tasklevel protections. However, these works usually fall into two extremes, either obtaining strong learning but having to trust the metalearner [23, 17] or trusting nobody but also obtaining low performance [5]. In response, we try to bridge the gap between these threat models by considering a model that makes a relaxation from tasklevel to recordlevel privacy but retains protections for each taskowner against all other parties. This relaxation can be reasonably justified in practical situations, as while tasklevel guarantees are strictly stronger, they may also be unnecessary. In particular, recordlevel guarantees are likely to be sufficient whenever single records each pertain to different individuals. For example, for hospitals, what we care about is providing privacy to the individual patients and not aggregate hospital information. For cellphones, if one can bound the number of texts that could contain the same sensitive information, then an straightforward extension of our setting and methods, which protects up to records simultaneously, could also be sufficient.
2.3 Differential Privacy (DP) in a SingleTask Setting
In terms of actually achieving privacy guarantees for machine learning, a defacto standard has been to apply DP, a provision which strongly limits what one can infer about the examples a given model was trained on. Assuming a training set , two common types of DP are considered.
Differential Privacy (Global DP). A randomized mechanism is differentially private if for all measurable and for all datasets that differ by at most one element:
If this holds for differing by at most elements, then group DP is achieved.
Local Differential Privacy. A randomized mechanism is locally differentially private if for any two possible training examples and measurable :
Global DP guarantees that it will be hard to infer the presence of a specific record in the training set by observing the output of . It assumes a trusted aggregator running gets to see directly and then privatizes the final output (usually by adding noise throughout training). On the other hand, local DP assumes a stronger threat model in which the aggregator also cannot be trusted. Thus, a random mechanism must be applied individually on each before the aggregator sees it. Local DP is a stronger guarantee as being locally DP implies being global DP by invariance to postprocessing [13], but it also generally results in worse model performance, since it suffers from provably hard minimax rates Duchi et al. [12].
2.4 Differential Privacy for a GBML Setting
In metalearning, there exists a hierarchy of agents and statistical queries, so we cannot as simply define global and local DP. Here, both the metalevel subprocedure ,, and the withintask subprocedure, , can be considered individual queries and a DP algorithm can implement either to be DP. Further, for each query, the procedure may be altered to satisfy either local DP or global DP. Thus, there are four fundamental options that follow from standard DP definitions.

Global DP: Releasing will at no point compromise information regarding any specific .

Local DP: Additionally, each is protected from being revealed to the metalearner.

TaskGlobal DP: Releasing will at no point compromise any specific .

TaskLocal DP: Additionally, each is protected from being revealed to taskowner.
To form analogies to singletask DP, the examples in the metalevel procedure are the model updates and the aggregator is the metalearner. For the withintask procedure, the examples are actually the individual records and the aggregator is the taskowner. As such, (1) is implemented by the metalearner, (2) and (3) are implemented by the taskowner, and (4) is implemented by recordowners.
By immunity to postprocessing, the guarantees for (3) and (4) also automatically apply to the release of any future iteration of , thus protecting against future taskowners as well. Meanwhile, though (1) and (2) by definition protect the identities of individual , they actually mask the entire presence or absence of any task, thus satisfying a tasklevel threat model. Intuitively, not being able to infer anything about implies that nothing can be inferred about the that was used to generate it.
As a consequence, we can thus directly compare versions of (2) and (3) since both are mechanisms implemented by taskowners. Indeed, as we prove in the Appendix, we have:
Remark 2.1.
If a GBML algorithm achieves local DP at the metalevel, it is also guaranteed to be taskglobal DP.
The converse, on the other hand, is not generally true, as while some taskglobalDP mechanisms may result in a localDP guarantee, the particular will not necessarily carry over. Both ensure that each taskowner has guarantee for releasing , but achieving local DP implies a tasklevel guarantee at the withintask level, while global DP at a withintask level may only provide recordlevel guarantees.
Previous Work  Notion of DP  Privacy for  Privacy for 

McMahan et al. [23]  Global  Tasklevel   
Geyer et al. [17]  Global  Tasklevel   
Bhowmick et al. [5]  Local, Global  Tasklevel  Tasklevel 
Agarwal et al. [1]  Local + MPC  Tasklevel  Tasklevel 
Truex et al. [28]  TaskGlobal + MPC  Recordlevel  Recordlevel 
Our work  TaskGlobal  Recordlevel  Recordlevel 
2.5 TaskGlobal DP in Comparison to Previous Works
Using the terminology we introduce in Section 2.4, previous works for DP in federated settings can be categorized as in Table 1. While these works do not assume a multitask setting, the terms global/local and taskglobal/tasklocal can still analogously refer to releasing the global model (done by the central server) and userspecific updates (done locally on users’ devices), respectively.
Geyer et al. [17] and McMahan et al. [23] both directly provide global DP boosted by subsampling and show how they can achieve performance very close to nonprivate training. However, this privacy guarantee may be provide may be fundamentally insufficient if there is reason to distrust the central server or the security of accessing its computations. Taskglobal DP, in comparison provides recordlevel protections but it does protect against these additional potential adversaries.
In contrast, both Bhowmick et al. [5] and Agarwal et al. [1] provide some form of local DP. a strictly stronger setting than taskglobal DP. However, Bhowmick et al. [5] shows it needs to concede a very large privacy budget in order to achieve reasonable performance. Agarwal et al. [1] also consider what is effectively a local DP guarantee but leverages SMC to reduce the amount of randomization needed per taskowner. As mentioned in 1.1, this comes with additional practical challenges and is somewhat of an orthogonal direction. Indeed, Truex et al. [28] considers applying SMC to improve what are inherently taskglobal DP mechanisms.
Lastly, we remark that localwithintask DP has not previously been studied as protecting individual data points from taskowners is something that is unlikely to be a concern (eg. cell phone users already own their text messages and one would assume patients already trust their hospitals).
Overall, in contrast to past works, we note that we are the first to formalize and consider the advantages and tradeoffs implicit in privatizing the withintask algorithm. Additionally, we show taskglobal DP is the first notion of DP that can be shown to enjoy any form of provable metalearning guarantees (Section 3) and also that it empirically improves upon local DP in terms of utility (Section 4).
3 Differentially Private ParameterTransfer
3.1 Algorithm
We now present our DP GBML method, which is written out in its online (regret) form in Algorithm 1. Here, we observe that both withintask optimization and metaoptimization are done using some form of gradient descent. The key difference between this algorithm and traditional GBML is that since tasklearners must send back privatized model updates, each now applies an DP gradient descent procedure to learn when called. However, at metatest time the tasklearner will run a nonprivate descent algorithm to obtain the parameter used for inference, as this parameter does not need to be sent to the metalearner. To obtain learningtheoretic guarantees, we use a variant of Algorithm 1 in which the DP algorithm is an SGD procedure [3, Algorithm 1]
that adds a propertly scaled Gaussian noise vector at each iteration. A stability result due to
Bassily et al. [3] regarding the population loss of this algorithm’s output allows us to provide bounds on the transfer risk due to our metaalgorithm.3.2 Privacy Guarantees
We run a certified DP version of SGD [3, Algorithm 1] within each task. Therefore, this guarantees that the contribution of each taskowner, a trained on their data, carries global DP guarantees with respect to the metalearner. Additionally, since DP is preserved under postprocessing, the release of any future calculation stemming from also carries the same DP guarantee.
3.3 Learning Guarantees
Our learning result follows the setup of Baxter [4], who formalized the LTL problem as using taskdistribution samples from some metadistribution and samples indexed by from those tasks to improve performance when a new task is sampled from and we draw samples from it. In the setting of parametertransfer metalearning we are learning functions parameterized by realvalued vectors , so our goal will follow that of Denevi et al. [11] and Khodak et al. [21] in seeking bounds on the transferrisk – the distributional performance of a learned parameter on a new task from – that improve with task similarity.
The specific tasksimilarity metric we consider is the average deviation of the riskminimizing parameters of tasks sampled from the distribution are close together. This will be measured interms of the following quantity: , for a riskminimizer of taskdistribution
. This quantity is roughly the variance of riskminimizing taskparameters and is a standard quantifier of improvement due to metalearning
[11, 21]. For example, Denevi et al. [11] show excess transferrisk guarantees of the form when tasks with samples are drawn from the distribution. This guarantee ensures that as we see more tasks our transfer risk becomes roughly , which if the tasks are similar, i.e. is small, implies that LTL improves over singletask learning.In Algorithm 1, each user obtains a withintask parameter by running (nonprivate) OGD on a sequence of losses and averaging the iterates. The regret of this procedure, when averaged across the users, implies a bound on the expected excess transfer risk of new task from when running OGD from a learned initialization [9]. Thus our goal is to bound this regret in terms of ; here we follow the Average RegretUpperBound Analysis (ARUBA) framework of Khodak et al. [21] and treat metaupdate procedure itself as an online algorithm optimizing a bound on the performance measure (regret) of each withintask algorithm. As OGD’s regret depends on the squared distance of the optimal parameter from the initialization , with no privacy concerns one could simply update using to recover guarantees similar to those in Denevi et al. [11] and Khodak et al. [21].
However, this approach requires sending to the metalearner, which is not private; instead in Algorithm 1 we send , which is the output of noisy SGD. To apply ARUBA, we need an additional assumption – that the losses satisfy the following quadratic growth (QG) property: for some ,
(1) 
Here is the risk minimizer of . This assumption, which Khodak et al. [20]
show is reasonable in settings such as logistic regression, amounts to a statistical nondegeneracy assumption on the parameterspace – that parameters far away from the riskminimizer do not have lowrisk. Note that QG is significantly weaker than strong convexity, which previous work
[15] has assumed to hold for task losses but does not hold for applicable cases such as fewshot leastsquares or logistic regression if the number of tasksamples is smaller than the datadimension.We are now able to state our main theoretical result, a proof of which is given in Appendix B. The result follows from a bound on the taskaverage regret (TAR) across all tasks of a simple online metalearning procedure that treats the update sent by each task as an approximation of the optimal parameter in hindsight . Since this parameter determines regret on that task, by reducing the metaupdate procedure to OCO on this sequence of functions in a manner similar to [20], we are able to show a tasksimilaritydependent bound on the TAR. Following this the statistical guarantee stems from a nested onlinetobatch conversion, a standard procedure to convert lowregret onlinelearning algorithms to lowrisk distributionlearning algorithms.
Theorem 3.1.
Suppose is a distribution over taskdistributions over Lipschtz,
Lipschitzsmooth, 1bounded convex loss functions
over parameter space with diameter , and let each satisfy the quadratic growth property (1). Suppose the distribution of each task is sampled i.i.d. from and we run Algorithm 1 with the DP procedure of Bassily et al. [3, Algorithm 1] to obtain as the average iterate for the metaupdate step. Then if for and we have the following bound on the expected transfer risk when a new task is sampled from , samples are drawn i.i.d. from , and we run OGD with learning rate starting from and use the average of the resulting iterates as the learned parameter:Here is any element of and the outer expectation is taken over and the randomness of the withintask DP mechanism. Note that this procedure is DP.
Theorem 3.1 shows that one can usefully run a DPalgorithm as the withintask method in metalearning and still obtain improvement due to tasksimilarity. Specifically, the standard term of is multiplied by , which is small if the tasks are related via the closeness of their risk minimizers. Thus we can use metalearning to improve withintask performance relative to singletask learning. We also obtain a very fast convergence of in the number of tasks. However, we do gain some terms due to the quadratic growth approximation and the privacy mechanism. Note that the assumption that both the functions and its gradients are Lipschitzcontinuous are standard and required by the noisy SGD procedure of Bassily et al. [3].
This theorem also gives us a relatively straightforward extension if the desire is to provide groupDP. Since any privacy mechanism that provides DP also provides DP guarantees for groups of size [13], we immediately have the following corollary.
Corollary 3.1.
Under the same assumptions and setting as Theorem 3.1, achieving group DP is possible with the same guarantee except replacing with .
For constant , this allows us to enjoy the stronger guarantee while maintaining largely the same learning rates. This is a useful result given that in some settings, it may be desired to simultaneously protect small groups of size , such as protecting entire families for hospital records.
4 Empirical Results
In this section, we present results that show it is possible to learn useful deep models in federated scenarios while still preserving taskglobal privacy. In particular, our focus is to evaluate the performance of models that have been optimzied with a taskglobal DP algorithm in comparison to models that are trained both nonprivately and models that were trained with the previously more commonly studied local DP. To this end, we evaluate performance of a LSTM RNN for language modeling tasks and apply a practical variant of Algorithm 1 that considers both tasks and withintask examples in batches instead of serially. To obtain withintask privacy, we alter the withintask algorithm to be DPSGD algorithm as implemented by TensorFlow Privacy^{1}^{1}1https://github.com/tensorflow/privacy and to obtain local privacy we use a modification of [23] where each task separately applies a Gaussian Mechanism on a single before sending model updates to the metalearner.
Datasets:
We train a next word predictor for two federated datasets: (1) The Shakespeare dataset as preprocessed by [7], and (2) a dataset constructed from Wikipedia articles drawn from the Wiki3029 dataset [2], where each article is used as a different task. For each dataset, we set a fixed number of tokens per task, discard tasks with fewer tokens than the specified, and discard samples from those tasks with more. We set the number of tokens per task to for Shakespeare and to for Wikipedia, divide tokens into sequences of length , and we refer to these modified datasets as Shakespeare800 and Wiki1600.
Meta Learning Algorithm.
We study the performance of our method when applied to the batch version of Reptile [24] (which, in our setup, reduces to personalized Federated Averaging when the metalearning rate is set to
). We tune various configurations of task batch size for all methods and for the nonprivate baseline, we also tune for multiple visits per client since there is no privacy degradation to account for. Additionally, we implement an exponential decay on the meta learning rate. We defer a full discussion of hyperparameter tuning to the Appendix.
Privacy Considerations.
For the taskglobal DP models, we set for each task in both Shakespeare800; and Wiki1600 and we implement it the tools provided by TensorFlow Privacy. Although their mechanism differs from the one presented in Section 3, it still lets us explore taskglobal privacy in a realistic setting. We use the the RDP accountant provided in order to keep track of our privacy budget, obtaining for both Shakespeare and Wikipedia. Finally, for both datasets, we make sure that all tasks and their samples are only seen once, as we cannot leverage any subsampling results if the metalearner can directly see who is sending updates each round.
For the localDP training, even though this notion of DP is stronger, we explore the same privacy budgets so as to obtain guarantees that are of the same confidence. Here, we run the DPFedAvg algorithm from [23]
with two key changes. First, we add Gaussian noise to each clipped set of model updates before returning them to the central server instead of after aggregation. Secondly, we iterate through tasks without replacement with a fixed batch size rather than sampling each task with independent probability in each new round. The first change gives us local DP, and the second is necessary since multiple visits to a single client results in significant degradation of the privacy guarantee and we want each client to end up with the same final privacy parameters.
Results.
Figure 2 shows the performance of both the nonprivate and taskglobal private versions of Reptile [24] on the Shakespeare and Wikipedia datasets. As expected, in neither case does the private (noised) version reach the same accuracy of the nonprivate (noiseless) version of the algorithm. Nonetheless, the private version still comes within of the nonprivate accuracy for Shakespeare800 and within for Wiki1600. Meanwhile achieving localmetalevel results in only about of the nonprivate accuracy on both Shakespeare800 and Wiki1600.
In practice, these differences could be toggled by changing the privacy budget for the algorithm, or for a given privacy budget, trading off more training iterations for larger noise multipliers.
Performance of different versions of Reptile on a nextwordprediction task for two federated datasets. We report the test accuracy on unseen tasks and repeat each experiment 10 times. Solid lines correspond to means, colored bands indicate 1 standard deviation, and dotted lines are for comparing final accuracies (privatized versions of the algorithm are trained only one visit per client).
5 Conclusions
In this work, we have outlined and studied the issue of privacy in the context of metalearning. Focusing on the class of gradientbased parametertransfer methods, we used differential privacy to address the privacy risks posed to taskowners by sharing taskspecific models with a central metalearner. To do so, we formalized and considered the notion of taskglobal
differential privacy, which guarantees that individual examples from the tasks are protected from all downstream agents (and particularly the metalearner). Working in this privacy model, we developed a differentially private algorithm that guarantees both this strong protection as well as learningtheoretic results in the convex setting. Finally, we demonstrate how this notion of privacy can translate into useful deep learning models for federated tasks.
Acknowledgments
This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services Award, a JP Morgan A.I. Research Faculty Award, and a Carnegie Bosch Institute Research Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.
References
 Agarwal et al. [2018] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan McMahan. cpsgd: Communicationefficient and differentiallyprivate distributed sgd. In Advances in Neural Information Processing Systems 31, pages 7564–7575. Curran Associates, Inc., 2018.
 Arora et al. [2019] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Nikunj Saunshi, and Orestis Plevrakis. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
 [3] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. URL https://arxiv.org/abs/1908.09970.

Baxter [2000]
Jonathan Baxter.
A model of inductive bias learning.
Journal of Artificial Intelligence Research
, 12:149–198, 2000.  Bhowmick et al. [2019] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning, 2019. https://arxiv.org/abs/1812.00984.
 Bonawitz et al. [2017] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy preserving machine learning. Cryptology ePrint Archive, Report 2017/281, 2017. https://eprint.iacr.org/2017/281.
 Caldas et al. [2018] Sebastian Caldas, Peter Wu, Tian Li, Jakub Konecný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. LEAF: A benchmark for federated settings, 2018. URL http://arxiv.org/abs/1812.01097.
 Carlini et al. [2018] Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets, 2018. URL http://arxiv.org/abs/1802.08232.
 CesaBianchi et al. [2004] Nicoló CesaBianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
 Chen et al. [2018] Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. Federated metalearning for recommendation. CoRR, abs/1802.07876, 2018. URL http://arxiv.org/abs/1802.07876.
 Denevi et al. [2019] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learningtolearn stochastic gradient descent with biased regularization, 2019. URL http://arxiv.org/abs/1903.10399.

[12]
John Duchi, Martin Wainwright, and Michael Jordan.
Minimax optimal procedures for locally private estimation.
In Journal of the American Statistical Association.  Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3&4):211–407, 2014. doi: 10.1561/0400000042.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 Finn et al. [2019] Chelsea Finn, Aravind Rajeswaran, Sham M. Kakade, and Sergey Levine. Online metalearning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
 Fredrikson et al. [2015] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333, 2015.
 Geyer et al. [2018] Robin C. Geyer, Tassilo J. Klein, and Moin Nabi. Differentially private federated learning: A client level perspective, 2018. URL https://openreview.net/forum?id=SkVRTj0cYQ.
 Jayaraman and Evans [2019] Bargav Jayaraman and David Evans. When relaxations go bad: "differentiallyprivate" machine learning, 2019. URL http://arxiv.org/abs/1902.08874.
 Karimi et al. [2016] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the PolyakŁojasiewicz condition. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016.
 Khodak et al. [2019a] Mikhail Khodak, MariaFlorina Balcan, and Ameet Talwalkar. Provable guarantees for gradientbased metalearning. In Proceedings of the 36th International Conference on Machine Learning, 2019a.
 Khodak et al. [2019b] Mikhail Khodak, MariaFlorina Balcan, and Ameet Talwalkar. Adaptive gradientbased metalearning methods. In Advances in Neural Information Processing Systems, 2019b. To Appear.
 McMahan et al. [2017] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1273–1282, 2017.
 McMahan et al. [2018] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private language models. In ICLR, 2018.
 Nichol et al. [2018] Alex Nichol, Joshua Achiam, and John Schulman. On firstorder metalearning algorithms. CoRR, abs/1803.02999, 2018. URL http://arxiv.org/abs/1803.02999.
 Shokri et al. [2017] Reza Shokri, Marco Stronati, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Proceedings of 2017 IEEE Symposium on Security and Privacy, pages 3–18, 2017.
 Smith et al. [2017] Virginia Smith, ChaoKai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multitask learning. In Advances in Neural Information Processing Systems 31, 2017.
 Stolfo et al. [1997] Salvatore J. Stolfo, David W. Fan, Wenke Lee, Andreas L. Prodromidis, and Philip K. Chan. Credit card fraud detection using metalearning: Issues and initial results 1. In Working notes of AAAI Workshop on AI Approaches to Fraud Detection and Risk Management., 1997.
 Truex et al. [2019] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, and Rui Zhang. A hybrid approach to privacypreserving federated learning, 2019. URL http://arxiv.org/abs/1812.03224.
 Zhang et al. [2019] Xi Sheryl Zhang, Fengyi Tang, Hiroko Dodge, Jiayu Zhou, and Fei Wang. Metapred: Metalearning for clinical risk prediction with limited patient electronic health records, 2019. URL https://arxiv.org/abs/1905.03218.
Appendix A LocalMetaLevel DP and taskglobal DP
Remark A.1.
If a GBML algorithm achieves local DP at the metalevel, it is also guaranteed to be DP at a taskglobal level.
Proof.
According to the definition of local DP, a mechanism that achieves local DP for releasing must satisfy for any and :
Here can also be seen as a function, possibly stochastic, of , or more formally, where is an initialization and . Thus, by also setting , we automatically get for any
This holds by definition when is deterministic since and are single elements from . When and are stochastic, this bound also holds since it holds even in the worst case for any single pair of elements in Further, the bound holds no matter how many elements differ between and , as long as outputs something in . Thus, if we treat as one mechanism, we get the given proposition.
∎
Appendix B Proofs of Learning Guarantees
Throughout this section we assume all subsets are convex and in unless explicitly stated. In the online learning setting we will use the shorthand to denote the subgradient of evaluated at action . For any we will use to refer to the sum of the first of them.
In this section we first prove (Theorem B.1) a general averagedregret bound following the ARUBA framework of Khodak et al. [21]. We then combine an algorithmic stability based DP generalization bound for noisy SGD of Bassily et al. [3] with a quadratic growth assumption [19, 20] to show that such an algorithm returns a metaupdate parameter that is close and thus suffices to show a meaningful taskaveragedregret guarantee (Corollary B.1). We conclude by using this bound to derive a guarantee in the statistical LTL setting (Corollary B.2).
Setting B.1.
We assume all functions are convex and Lipschitz for some and that has diameter . We define the following quantities:

convenience coefficients

the sequence of update parameters with mean

a sequence of reference parameters with mean

a sequence of optimal parameters in hindsight

s.t.

s.t.

positive tasksimilarity

learningrate for some
Theorem B.1.
In Setting B.1 define the regret upperbound and the averaged regret upperbound . Then in Algorithm 2 if the metalearner uses FTL or AOGD to pick the metainitialization and the withintask descent algorithm has regret upperbounded by we have the following bound:
Here the expectation is taken over the randomness of the DP mechanism.
Proof.
Setting B.2.
In Setting B.1, assume loss functions are generated by picking some distribution over valid losses and then sampling of them i.i.d. Assume further that the expected loss of every such distribution satisfies quadraticgrowth (QG): for some , any , and the closest minimizer of to we have
Furthermore, assume that these losses are stronglysmooth:
Finally, assume that is unique for every .
Lemma B.1.
Let be a sequence of convex losses drawn i.i.d. from some distribution with risk being QG and let be any of the optimal actions in hindsight. Then the closest minimum of to satisfies
Proof.
Lemma B.2.
Let be a sequence of stronglysmooth, Lipschitz convex losses drawn i.i.d. from some distribution with risk being QG and let be the average iterate of running Algorithm 1 of Bassily et al. [3] with the appropriate parameters for obtaining DP. If then the closest minimum of to satisfies
Proof.
The result follows by directly substituting Theorem 3.2 of Bassily et al. [3] into the definition of QG:
∎
Proposition B.1.
In Setting B.2 we have and
Proof.
Corollary B.1.
Corollary B.2.
In Setting B.2 and under the assumptions of Corollary B.1, if the distribution of each task is sampled i.i.d. from some environment and then we have the following bound on the expected transfer risk when a new task is sampled from , samples are drawn i.i.d. from , and we run OGD with starting from and use the average of the resulting iterates as the learned parameter:
Here is any element of and the outer expectation is taken over and the randomness of the DP mechanism.
Appendix C Experiment Details
Datasets:
We train a next word predictor for two federated datasets: (1) The Shakespeare dataset as preprocessed by [7], and (2) a dataset constructed from Wikipedia articles, where each article is used as a different task. For each dataset, we set a fixed number of tokens per task, discard tasks with less tokens than the specified, and discard samples from those tasks with more. For Shakespeare, we set the number of tokens per task to tokens, leaving tasks for metatraining, for metavalidation, and for metatesting. For Wikipedia, we set the number of tokens to , which corresponds to having tasks for metatraining, for metavalidation, and for metatesting. For the metavalidation and metatest tasks, of the tokens are used for local training, and the remaining for local testing.
Model Structure:
Our model first maps each token to an embedding of dimension before passing it through an LSTM of two layers of units each. The LSTM emits an output embedding, which is scored against all items of the vocabulary via dot product followed by a softmax. We build the vocabulary from the tokens in the metatraining set and fix its length to . We use a sequence length of for the LSTM and, just as [23], we evaluate using AccuracyTop1 (i.e., we only consider the predicted word to which the model assigned the highest probability) and consider all predictions of the unknown token as incorrect.
Hyperparameters:
We tune the hyperparameters on the set of metavalidation tasks. For all datasets and all versions of the metalearning algorithm, we tune hyperparameters in a two step process. We first tune all the parameters that are not related to refinement: the meta learning rate, the local (withintask) metatraining learning rate, the maximum gradient norm, and the decay constant. Then, we use the configuration with the best accuracy prerefinement and then tune the refinement parameters: the refine learning rate, refine batch size, and refine epochs.
All other hyperparameters are kept fixed for the sake of comparison: full batch steps were taken on withintask data, with the maximum number of microbatches used for the taskglobal DP model. The parameter search spaces are given in Tables 2, 3, 4. In these tables, the final hyperparameters we used are in bold.
Hyperparameter  Shakespeare  Wiki 

Visits Per Task  
Tasks Per Round  
WithinTask Epochs  
Meta LR  
Meta Decay Rate  
WithinTask LR  
Clipping  
Refine LR  
Refine Minibatch Size  
Refine Epochs 
Model  Shakespeare  Wiki 

Visits Per Task  
Tasks Per Round  
WithinTask Epochs  
Meta LR  
Meta Decay Rate  
WithinTask LR  
Clipping  
Refine LR  
Refine Minibatch Size  
Refine Epochs 
Model  Shakespeare  Wiki 

Visits Per Task  
Tasks Per Round  
WithinTask Epochs  
Meta LR  
Meta Decay Rate  
WithinTask LR  
Clipping  
Refine LR  
Refine Minibatch Size  
Refine Epochs 