1 Introduction
The field of meta-learning
offers promising directions for improving the performance and adaptability of machine learning methods. At a high level, the key assumption leveraged by these approaches is that the
sharing of knowledge gained from individual learning tasks can help catalyze the learning of similar unseen tasks. However, the collaborative nature of this process, in which task-specific information must be sent to and used by a meta-learner, also introduces inherent data privacy risks.In this work, we focus on a popular and flexible meta-learning approach, parameter transfer via gradient-based meta-learning (GBML). This set of methods, which includes well-known algorithms such as MAML [14] and Reptile [24], tries to learn a common initialization over a set of tasks such that a high-performance model can be learned in only a few gradient-steps on new tasks. Notably, information flows constantly between training tasks and the meta-learner as learning progresses; to make iterative updates, the meta-learner obtains feedback on the current by training task-specific models with it.
Meanwhile, in many settings it is crucial to ensure that sensitive information in each task-specific dataset stays private. Examples of this include learning models for next-word prediction on cell phone data [23], clinical predictions using hospital records [29], and fraud detectors for competing credit card companies [27]. In such cases, each data-owner can benefit from information learned from other tasks, but each also desires, or is legally required, to keep their raw data private. Thus, it is not sufficient to learn a well-performing ; it is equally imperative to ensure that a task’s sensitive information is not obtainable by anyone else.
While parameter transfer algorithms can move towards this goal by peforming task-specific optimization locally, thus preventing direct access to private data, this provision is far from fail-safe in terms of privacy. A wealth of work has shown in the single-task setting that it is possible for an adversary with only access to the model to learn detailed information about the training set, such as the presence or absence of specific records [25] or the identities of sensitive features given other covariates [16]. Furthermore, Carlini et al. [8]
showed that deep neural networks can effectively memorize user-unique training examples, which can be recovered even after only a single epoch of training. As such, in parameter-transfer methods, the meta-learner or any downstream participant can potentially recover data from a previous task.
However, despite these serious risks, privacy-preserving meta-learning has remained largely an unstudied problem. Our work aims to address this issue by applying differential privacy (DP) [13], a well-established definition of privacy with rich theoretical guarantees and consistent empirical success at preventing leakages of data [8, 16, 18]. Crucially, although there are various threat models and degrees of DP one could consider in the meta-learning setting (as we outline in Section 2), we balance the well-documented trade-off between privacy and model utility by formalizing and focusing on a setting that we call task-global DP. This setting provides a strong privacy guarantee for each task-owner that sharing with the meta-learner will not reliably reveal anything about specific training examples to any downstream agent. It also allows us to use the framework of Khodak et al. [20] to provide a DP GBML algorithm that enjoys provable learning guarantees in convex settings.
Finally, we show an application of our work by drawing connections to federated learning (FL). While standard methods for FL, such as FedAvg [22], have inspired many works also concerning DP in a multi-user setup [1, 12, 17, 23, 28], we are the first to consider task-global DP as a useful variation on standard DP settings. Moreover, these works fundamentally differ from ours in that they do not consider a task-based notion of learnability as they aim to learn a single global model (since by design they focus on the global federated learning problem). That being said, a federated setting involving per-user personalization [10, 26] is a natural meta-learning application.
More specifically, our main contributions are:
-
We are the first to taxonomize the different notions of DP possible for meta-learning; in particular, we formalize on a variant we call task-global DP, showing and arguing that it adds a useful option to commonly studied settings in terms of trading privacy and accuracy.
-
We propose the first DP GBML algorithm, which we construct to satisfy this privacy setting. Further, we show a straightforward extension for obtaining a group DP version of our setting to protect multiple samples simultaneously.
-
We show that our algorithm, along with its theoretical guarantees, naturally carries over to federated learning with personalization. Compared to previous notions of privacy considered in works for DP federated learning [1, 5, 17, 23, 28], we are, to the best of our knowledge, the first to simultaneously provide both privacy and learning guarantees.
-
Empirically, we demonstrate that our proposed privacy setting allows for strong performance on non-convex federated language modeling tasks. We achieve close to the performance of non-private models and significantly improve upon the performance of models trained with local-DP guarantees, a previously studied notion that also provides protections against the meta-learner. Our setting reasonably relaxes this latter notion but can achieve roughly times the performance on a modified version of the Shakespeare dataset [7] and times the performance on a modified version of Wiki-3029 [2].
1.1 Related Work
DP Algorithms in Federated Learning Settings. Works most similar to ours focus on providing DP for federated learning. Specifically, Geyer et al. [17] and McMahan et al. [23] apply update clipping and the Gaussian Mechanism to achieve global DP federated learning algorithms for language modeling and image classification tasks, respectively. Their methods are shown to only suffer minor drops in accuracy compared to non-private training but they do not consider protections to inferences made by the meta-learner. Alternatively, Bhowmick et al. [5] does achieve such protection by applying a theoretically rate-optimal local DP mechanism on the ’s users send to the meta-learner. However, they sidestep hard minimax rates [12] by assuming limited adversaries have limited side-information and allowing for a large privacy budget. In this work, though we achieve a relaxation of the privacy of Bhowmick et al. [5], we do not restrict the adversary’s power. Finally, Truex et al. [28] does consider a setting that coincides with task-global DP, but they focus primarily on the added benefits of applying MPC (see below) rather than studying the merits of the setting in comparison to other potential settings. Although these approaches all study privacy through the lens of learning a single global model, many of them, as well as our proposed GBML algorithm, are naturally amenable to a federated learning setting with personalization.
Secure Multiparty Computation (MPC). MPC is a cryptographic technique that allows parties to calculate a function of their inputs while also maintaining the privacy of each individual party’s inputs [6]. In GBML, sets of model updates may come in a batch from multiple tasks, and hence MPC can securely aggregate the batch before it is seen by the meta-learner. Though MPC itself gives no DP guarantees, it prevents the meta-learner from directly accessing any one task’s updates and can thus be combined with DP to increase privacy. Analogues of this approach have been studied in the federated setting, e.g. by Agarwal et al. [1], who apply SMC in the same difficult setting of Bhowmick et al. [5], and Truex et al. [28], who apply SMC similarly to a setting analogous to ours. On the other hand, MPC also comes with additional practical challenges such as peer-to-peer communication costs, drop outs, and vulnerability to collaborating participants. As such, combined with its applicability to multiple settings, including ours, we consider MPC to be an orthogonal direction.
2 Privacy in a Meta-Learning Context
In this section, we first formalize the meta-learning setting that we consider. We then describe the various threat models that arise in the GBML setup, before presenting the different DP notions that can be achieved. Finally, we highlight the specific model and type of DP that we analyze.
2.1 Parameter Transfer Meta-Learning
In parameter transfer meta-learning, we assume that there is a set of learning tasks , each with its corresponding disjoint training set . Each contains training examples where each . The goal within each task is to learn a function parameterized by that performs “well,” generally in the sense that it has low within-task population risk in the distributional setting. The meta-learner’s goal is to learn an initialization that leads to a well-performing within-task. In GBML this is learned via an iterative process that alternates between the following two steps: (1) a within-task procedure where a batch of task-owners receives the current and each uses as an initialization for running a within-task optimization procedure, obtaining ; (2) a meta-level procedure where the meta-learner receives these model updates and aggregates them to determine an updated .
Notably, since both sub-procedures only need to receive the output from the other, an overall GBML algorithm can modularly change each one. From a privacy standpoint, even if the within-task sub-procedure is always done locally, specific information about is vulnerable to being inferred by anyone who receives , namely the meta-learner. Similarly, the meta-level procedure can potentially reveal sensitive information about previously seen task-owner’s through revealing to future recipients of , thus leaving task-owners vulnerable to each other.
2.2 Threat Models for GBML
As in any privacy endeavor, before discussing particular mechanisms, a key specification must be made in terms of what threat model is being considered. In particular, it must be specified both (1) who the potential adversaries are and (2) what information needs to be protected.
Potential adversaries. For a single task-owner, adversaries may be either solely recipients of (i.e. other task-owners) or recipients of either or (i.e. also the meta-learner). In the latter case, we consider only a honest-but-curious meta-learner, who does not deviate from the agreed upon algorithm but may try to make inferences based on the information it receives. In both cases, concern is placed not only about the intentions of these other participants, but also their own security against access by malicious outsiders.
Data to be protected. A system can choose either to protect information contained in single records one-at-a-time or to protect entire datasets simultaneously. This distinction between record-level and task-level privacy can be practically important. Multiple within may reveal the same secret (e.g., a cell-phone user has sent their SSN multiple times), or the entire distribution of
could reveal sensitive information (e.g., a user has sent all messages in a foreign language). In these cases, record-level privacy may not be sufficient. However, given that privacy and utility are often at odds, we often seek the weakest notion of privacy needed in order to best preserve utility.
In related work, focus has primarily been placed on task-level protections. However, these works usually fall into two extremes, either obtaining strong learning but having to trust the meta-learner [23, 17] or trusting nobody but also obtaining low performance [5]. In response, we try to bridge the gap between these threat models by considering a model that makes a relaxation from task-level to record-level privacy but retains protections for each task-owner against all other parties. This relaxation can be reasonably justified in practical situations, as while task-level guarantees are strictly stronger, they may also be unnecessary. In particular, record-level guarantees are likely to be sufficient whenever single records each pertain to different individuals. For example, for hospitals, what we care about is providing privacy to the individual patients and not aggregate hospital information. For cell-phones, if one can bound the number of texts that could contain the same sensitive information, then an straightforward extension of our setting and methods, which protects up to records simultaneously, could also be sufficient.

2.3 Differential Privacy (DP) in a Single-Task Setting
In terms of actually achieving privacy guarantees for machine learning, a de-facto standard has been to apply DP, a provision which strongly limits what one can infer about the examples a given model was trained on. Assuming a training set , two common types of DP are considered.
Differential Privacy (Global DP). A randomized mechanism is -differentially private if for all measurable and for all datasets that differ by at most one element:
If this holds for differing by at most elements, then -group DP is achieved.
Local Differential Privacy. A randomized mechanism is -locally differentially private if for any two possible training examples and measurable :
Global DP guarantees that it will be hard to infer the presence of a specific record in the training set by observing the output of . It assumes a trusted aggregator running gets to see directly and then privatizes the final output (usually by adding noise throughout training). On the other hand, local DP assumes a stronger threat model in which the aggregator also cannot be trusted. Thus, a random mechanism must be applied individually on each before the aggregator sees it. Local DP is a stronger guarantee as being -locally DP implies being -global DP by invariance to post-processing [13], but it also generally results in worse model performance, since it suffers from provably hard minimax rates Duchi et al. [12].
2.4 Differential Privacy for a GBML Setting
In meta-learning, there exists a hierarchy of agents and statistical queries, so we cannot as simply define global and local DP. Here, both the meta-level sub-procedure ,, and the within-task sub-procedure, , can be considered individual queries and a DP algorithm can implement either to be DP. Further, for each query, the procedure may be altered to satisfy either local DP or global DP. Thus, there are four fundamental options that follow from standard DP definitions.
-
Global DP: Releasing will at no point compromise information regarding any specific .
-
Local DP: Additionally, each is protected from being revealed to the meta-learner.
-
Task-Global DP: Releasing will at no point compromise any specific .
-
Task-Local DP: Additionally, each is protected from being revealed to task-owner.
To form analogies to single-task DP, the examples in the meta-level procedure are the model updates and the aggregator is the meta-learner. For the within-task procedure, the examples are actually the individual records and the aggregator is the task-owner. As such, (1) is implemented by the meta-learner, (2) and (3) are implemented by the task-owner, and (4) is implemented by record-owners.
By immunity to post-processing, the guarantees for (3) and (4) also automatically apply to the release of any future iteration of , thus protecting against future task-owners as well. Meanwhile, though (1) and (2) by definition protect the identities of individual , they actually mask the entire presence or absence of any task, thus satisfying a task-level threat model. Intuitively, not being able to infer anything about implies that nothing can be inferred about the that was used to generate it.
As a consequence, we can thus directly compare versions of (2) and (3) since both are mechanisms implemented by task-owners. Indeed, as we prove in the Appendix, we have:
Remark 2.1.
If a GBML algorithm achieves -local DP at the meta-level, it is also guaranteed to be -task-global DP.
The converse, on the other hand, is not generally true, as while some task-global-DP mechanisms may result in a local-DP guarantee, the particular will not necessarily carry over. Both ensure that each task-owner has guarantee for releasing , but achieving local DP implies a task-level guarantee at the within-task level, while global DP at a within-task level may only provide record-level guarantees.
Previous Work | Notion of DP | Privacy for | Privacy for |
---|---|---|---|
McMahan et al. [23] | Global | Task-level | - |
Geyer et al. [17] | Global | Task-level | - |
Bhowmick et al. [5] | Local, Global | Task-level | Task-level |
Agarwal et al. [1] | Local + MPC | Task-level | Task-level |
Truex et al. [28] | Task-Global + MPC | Record-level | Record-level |
Our work | Task-Global | Record-level | Record-level |
2.5 Task-Global DP in Comparison to Previous Works
Using the terminology we introduce in Section 2.4, previous works for DP in federated settings can be categorized as in Table 1. While these works do not assume a multi-task setting, the terms global/local and task-global/task-local can still analogously refer to releasing the global model (done by the central server) and user-specific updates (done locally on users’ devices), respectively.
Geyer et al. [17] and McMahan et al. [23] both directly provide global DP boosted by sub-sampling and show how they can achieve performance very close to non-private training. However, this privacy guarantee may be provide may be fundamentally insufficient if there is reason to distrust the central server or the security of accessing its computations. Task-global DP, in comparison provides record-level protections but it does protect against these additional potential adversaries.
In contrast, both Bhowmick et al. [5] and Agarwal et al. [1] provide some form of local DP. a strictly stronger setting than task-global DP. However, Bhowmick et al. [5] shows it needs to concede a very large privacy budget in order to achieve reasonable performance. Agarwal et al. [1] also consider what is effectively a local DP guarantee but leverages SMC to reduce the amount of randomization needed per task-owner. As mentioned in 1.1, this comes with additional practical challenges and is somewhat of an orthogonal direction. Indeed, Truex et al. [28] considers applying SMC to improve what are inherently task-global DP mechanisms.
Lastly, we remark that local-within-task DP has not previously been studied as protecting individual data points from task-owners is something that is unlikely to be a concern (eg. cell phone users already own their text messages and one would assume patients already trust their hospitals).
Overall, in contrast to past works, we note that we are the first to formalize and consider the advantages and trade-offs implicit in privatizing the within-task algorithm. Additionally, we show task-global DP is the first notion of DP that can be shown to enjoy any form of provable meta-learning guarantees (Section 3) and also that it empirically improves upon local DP in terms of utility (Section 4).
3 Differentially Private Parameter-Transfer
3.1 Algorithm
We now present our DP GBML method, which is written out in its online (regret) form in Algorithm 1. Here, we observe that both within-task optimization and meta-optimization are done using some form of gradient descent. The key difference between this algorithm and traditional GBML is that since task-learners must send back privatized model updates, each now applies an DP gradient descent procedure to learn when called. However, at meta-test time the task-learner will run a non-private descent algorithm to obtain the parameter used for inference, as this parameter does not need to be sent to the meta-learner. To obtain learning-theoretic guarantees, we use a variant of Algorithm 1 in which the DP algorithm is an SGD procedure [3, Algorithm 1]
that adds a propertly scaled Gaussian noise vector at each iteration. A stability result due to
Bassily et al. [3] regarding the population loss of this algorithm’s output allows us to provide bounds on the transfer risk due to our meta-algorithm.3.2 Privacy Guarantees
We run a certified -DP version of SGD [3, Algorithm 1] within each task. Therefore, this guarantees that the contribution of each task-owner, a trained on their data, carries global DP guarantees with respect to the meta-learner. Additionally, since DP is preserved under post-processing, the release of any future calculation stemming from also carries the same DP guarantee.
3.3 Learning Guarantees
Our learning result follows the setup of Baxter [4], who formalized the LTL problem as using task-distribution samples from some meta-distribution and samples indexed by from those tasks to improve performance when a new task is sampled from and we draw samples from it. In the setting of parameter-transfer meta-learning we are learning functions parameterized by real-valued vectors , so our goal will follow that of Denevi et al. [11] and Khodak et al. [21] in seeking bounds on the transfer-risk – the distributional performance of a learned parameter on a new task from – that improve with task similarity.
The specific task-similarity metric we consider is the average deviation of the risk-minimizing parameters of tasks sampled from the distribution are close together. This will be measured in-terms of the following quantity: , for a risk-minimizer of task-distribution
. This quantity is roughly the variance of risk-minimizing task-parameters and is a standard quantifier of improvement due to meta-learning
[11, 21]. For example, Denevi et al. [11] show excess transfer-risk guarantees of the form when tasks with samples are drawn from the distribution. This guarantee ensures that as we see more tasks our transfer risk becomes roughly , which if the tasks are similar, i.e. is small, implies that LTL improves over single-task learning.In Algorithm 1, each user obtains a within-task parameter by running (non-private) OGD on a sequence of losses and averaging the iterates. The regret of this procedure, when averaged across the users, implies a bound on the expected excess transfer risk of new task from when running OGD from a learned initialization [9]. Thus our goal is to bound this regret in terms of ; here we follow the Average Regret-Upper-Bound Analysis (ARUBA) framework of Khodak et al. [21] and treat meta-update procedure itself as an online algorithm optimizing a bound on the performance measure (regret) of each within-task algorithm. As OGD’s regret depends on the squared distance of the optimal parameter from the initialization , with no privacy concerns one could simply update using to recover guarantees similar to those in Denevi et al. [11] and Khodak et al. [21].
However, this approach requires sending to the meta-learner, which is not private; instead in Algorithm 1 we send , which is the output of noisy SGD. To apply ARUBA, we need an additional assumption – that the losses satisfy the following quadratic growth (QG) property: for some ,
(1) |
Here is the risk minimizer of . This assumption, which Khodak et al. [20]
show is reasonable in settings such as logistic regression, amounts to a statistical non-degeneracy assumption on the parameter-space – that parameters far away from the risk-minimizer do not have low-risk. Note that QG is significantly weaker than strong convexity, which previous work
[15] has assumed to hold for task losses but does not hold for applicable cases such as few-shot least-squares or logistic regression if the number of task-samples is smaller than the data-dimension.We are now able to state our main theoretical result, a proof of which is given in Appendix B. The result follows from a bound on the task-average regret (TAR) across all tasks of a simple online meta-learning procedure that treats the update sent by each task as an approximation of the optimal parameter in hindsight . Since this parameter determines regret on that task, by reducing the meta-update procedure to OCO on this sequence of functions in a manner similar to [20], we are able to show a task-similarity-dependent bound on the TAR. Following this the statistical guarantee stems from a nested online-to-batch conversion, a standard procedure to convert low-regret online-learning algorithms to low-risk distribution-learning algorithms.
Theorem 3.1.
Suppose is a distribution over task-distributions over -Lipschtz, -Lipschitz-smooth, 1-bounded convex loss functions
Here is any element of and the outer expectation is taken over and the randomness of the within-task DP mechanism. Note that this procedure is -DP.
Theorem 3.1 shows that one can usefully run a DP-algorithm as the within-task method in meta-learning and still obtain improvement due to task-similarity. Specifically, the standard term of is multiplied by , which is small if the tasks are related via the closeness of their risk minimizers. Thus we can use meta-learning to improve within-task performance relative to single-task learning. We also obtain a very fast convergence of in the number of tasks. However, we do gain some terms due to the quadratic growth approximation and the privacy mechanism. Note that the assumption that both the functions and its gradients are Lipschitz-continuous are standard and required by the noisy SGD procedure of Bassily et al. [3].
This theorem also gives us a relatively straightforward extension if the desire is to provide -group-DP. Since any privacy mechanism that provides -DP also provides -DP guarantees for groups of size [13], we immediately have the following corollary.
Corollary 3.1.
Under the same assumptions and setting as Theorem 3.1, achieving -group DP is possible with the same guarantee except replacing with .
For constant , this allows us to enjoy the stronger guarantee while maintaining largely the same learning rates. This is a useful result given that in some settings, it may be desired to simultaneously protect small groups of size , such as protecting entire families for hospital records.
4 Empirical Results
In this section, we present results that show it is possible to learn useful deep models in federated scenarios while still preserving task-global privacy. In particular, our focus is to evaluate the performance of models that have been optimzied with a task-global DP algorithm in comparison to models that are trained both non-privately and models that were trained with the previously more commonly studied local DP. To this end, we evaluate performance of a LSTM RNN for language modeling tasks and apply a practical variant of Algorithm 1 that considers both tasks and within-task examples in batches instead of serially. To obtain within-task privacy, we alter the within-task algorithm to be DP-SGD algorithm as implemented by TensorFlow Privacy111https://github.com/tensorflow/privacy and to obtain local privacy we use a modification of [23] where each task separately applies a Gaussian Mechanism on a single before sending model updates to the meta-learner.
Datasets:
We train a next word predictor for two federated datasets: (1) The Shakespeare dataset as preprocessed by [7], and (2) a dataset constructed from Wikipedia articles drawn from the Wiki-3029 dataset [2], where each article is used as a different task. For each dataset, we set a fixed number of tokens per task, discard tasks with fewer tokens than the specified, and discard samples from those tasks with more. We set the number of tokens per task to for Shakespeare and to for Wikipedia, divide tokens into sequences of length , and we refer to these modified datasets as Shakespeare-800 and Wiki-1600.
Meta Learning Algorithm.
We study the performance of our method when applied to the batch version of Reptile [24] (which, in our setup, reduces to personalized Federated Averaging when the meta-learning rate is set to
). We tune various configurations of task batch size for all methods and for the non-private baseline, we also tune for multiple visits per client since there is no privacy degradation to account for. Additionally, we implement an exponential decay on the meta learning rate. We defer a full discussion of hyperparameter tuning to the Appendix.
Privacy Considerations.
For the task-global DP models, we set for each task in both Shakespeare-800; and Wiki-1600 and we implement it the tools provided by TensorFlow Privacy. Although their mechanism differs from the one presented in Section 3, it still lets us explore task-global privacy in a realistic setting. We use the the RDP accountant provided in order to keep track of our privacy budget, obtaining for both Shakespeare and Wikipedia. Finally, for both datasets, we make sure that all tasks and their samples are only seen once, as we cannot leverage any sub-sampling results if the meta-learner can directly see who is sending updates each round.
For the local-DP training, even though this notion of DP is stronger, we explore the same privacy budgets so as to obtain guarantees that are of the same confidence. Here, we run the DP-FedAvg algorithm from [23]
with two key changes. First, we add Gaussian noise to each clipped set of model updates before returning them to the central server instead of after aggregation. Secondly, we iterate through tasks without replacement with a fixed batch size rather than sampling each task with independent probability in each new round. The first change gives us local DP, and the second is necessary since multiple visits to a single client results in significant degradation of the privacy guarantee and we want each client to end up with the same final privacy parameters.
Results.
Figure 2 shows the performance of both the non-private and task-global private versions of Reptile [24] on the Shakespeare and Wikipedia datasets. As expected, in neither case does the private (noised) version reach the same accuracy of the non-private (noiseless) version of the algorithm. Nonetheless, the private version still comes within of the non-private accuracy for Shakespeare-800 and within for Wiki-1600. Meanwhile achieving local-meta-level results in only about of the non-private accuracy on both Shakespeare-800 and Wiki-1600.
In practice, these differences could be toggled by changing the privacy budget for the algorithm, or for a given privacy budget, trading off more training iterations for larger noise multipliers.
![]() |
![]() |
Performance of different versions of Reptile on a next-word-prediction task for two federated datasets. We report the test accuracy on unseen tasks and repeat each experiment 10 times. Solid lines correspond to means, colored bands indicate 1 standard deviation, and dotted lines are for comparing final accuracies (privatized versions of the algorithm are trained only one visit per client).
5 Conclusions
In this work, we have outlined and studied the issue of privacy in the context of meta-learning. Focusing on the class of gradient-based parameter-transfer methods, we used differential privacy to address the privacy risks posed to task-owners by sharing task-specific models with a central meta-learner. To do so, we formalized and considered the notion of task-global
differential privacy, which guarantees that individual examples from the tasks are protected from all downstream agents (and particularly the meta-learner). Working in this privacy model, we developed a differentially private algorithm that guarantees both this strong protection as well as learning-theoretic results in the convex setting. Finally, we demonstrate how this notion of privacy can translate into useful deep learning models for federated tasks.
Acknowledgments
This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services Award, a JP Morgan A.I. Research Faculty Award, and a Carnegie Bosch Institute Research Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.
References
- Agarwal et al. [2018] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan McMahan. cpsgd: Communication-efficient and differentially-private distributed sgd. In Advances in Neural Information Processing Systems 31, pages 7564–7575. Curran Associates, Inc., 2018.
- Arora et al. [2019] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Nikunj Saunshi, and Orestis Plevrakis. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- [3] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochastic convex optimization with optimal rates. URL https://arxiv.org/abs/1908.09970.
-
Baxter [2000]
Jonathan Baxter.
A model of inductive bias learning.
Journal of Artificial Intelligence Research
, 12:149–198, 2000. - Bhowmick et al. [2019] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning, 2019. https://arxiv.org/abs/1812.00984.
- Bonawitz et al. [2017] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy preserving machine learning. Cryptology ePrint Archive, Report 2017/281, 2017. https://eprint.iacr.org/2017/281.
- Caldas et al. [2018] Sebastian Caldas, Peter Wu, Tian Li, Jakub Konecný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. LEAF: A benchmark for federated settings, 2018. URL http://arxiv.org/abs/1812.01097.
- Carlini et al. [2018] Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets, 2018. URL http://arxiv.org/abs/1802.08232.
- Cesa-Bianchi et al. [2004] Nicoló Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
- Chen et al. [2018] Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. Federated meta-learning for recommendation. CoRR, abs/1802.07876, 2018. URL http://arxiv.org/abs/1802.07876.
- Denevi et al. [2019] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization, 2019. URL http://arxiv.org/abs/1903.10399.
-
[12]
John Duchi, Martin Wainwright, and Michael Jordan.
Minimax optimal procedures for locally private estimation.
In Journal of the American Statistical Association. - Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3&4):211–407, 2014. doi: 10.1561/0400000042.
- Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
- Finn et al. [2019] Chelsea Finn, Aravind Rajeswaran, Sham M. Kakade, and Sergey Levine. Online meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Fredrikson et al. [2015] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333, 2015.
- Geyer et al. [2018] Robin C. Geyer, Tassilo J. Klein, and Moin Nabi. Differentially private federated learning: A client level perspective, 2018. URL https://openreview.net/forum?id=SkVRTj0cYQ.
- Jayaraman and Evans [2019] Bargav Jayaraman and David Evans. When relaxations go bad: "differentially-private" machine learning, 2019. URL http://arxiv.org/abs/1902.08874.
- Karimi et al. [2016] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016.
- Khodak et al. [2019a] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019a.
- Khodak et al. [2019b] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, 2019b. To Appear.
- McMahan et al. [2017] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1273–1282, 2017.
- McMahan et al. [2018] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private language models. In ICLR, 2018.
- Nichol et al. [2018] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018. URL http://arxiv.org/abs/1803.02999.
- Shokri et al. [2017] Reza Shokri, Marco Stronati, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Proceedings of 2017 IEEE Symposium on Security and Privacy, pages 3–18, 2017.
- Smith et al. [2017] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems 31, 2017.
- Stolfo et al. [1997] Salvatore J. Stolfo, David W. Fan, Wenke Lee, Andreas L. Prodromidis, and Philip K. Chan. Credit card fraud detection using meta-learning: Issues and initial results 1. In Working notes of AAAI Workshop on AI Approaches to Fraud Detection and Risk Management., 1997.
- Truex et al. [2019] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, and Rui Zhang. A hybrid approach to privacy-preserving federated learning, 2019. URL http://arxiv.org/abs/1812.03224.
- Zhang et al. [2019] Xi Sheryl Zhang, Fengyi Tang, Hiroko Dodge, Jiayu Zhou, and Fei Wang. Metapred: Meta-learning for clinical risk prediction with limited patient electronic health records, 2019. URL https://arxiv.org/abs/1905.03218.
Appendix A Local-Meta-Level DP and task-global DP
Remark A.1.
If a GBML algorithm achieves -local DP at the meta-level, it is also guaranteed to be -DP at a task-global level.
Proof.
According to the definition of local DP, a mechanism that achieves -local DP for releasing must satisfy for any and :
Here can also be seen as a function, possibly stochastic, of , or more formally, where is an initialization and . Thus, by also setting , we automatically get for any
This holds by definition when is deterministic since and are single elements from . When and are stochastic, this bound also holds since it holds even in the worst case for any single pair of elements in Further, the bound holds no matter how many elements differ between and , as long as outputs something in . Thus, if we treat as one mechanism, we get the given proposition.
∎
Appendix B Proofs of Learning Guarantees
Throughout this section we assume all subsets are convex and in unless explicitly stated. In the online learning setting we will use the shorthand to denote the subgradient of evaluated at action . For any we will use to refer to the sum of the first of them.
In this section we first prove (Theorem B.1) a general averaged-regret bound following the ARUBA framework of Khodak et al. [21]. We then combine an algorithmic stability based -DP generalization bound for noisy SGD of Bassily et al. [3] with a quadratic growth assumption [19, 20] to show that such an algorithm returns a meta-update parameter that is close and thus suffices to show a meaningful task-averaged-regret guarantee (Corollary B.1). We conclude by using this bound to derive a guarantee in the statistical LTL setting (Corollary B.2).
Setting B.1.
We assume all functions are convex and -Lipschitz for some and that has -diameter . We define the following quantities:
-
convenience coefficients
-
the sequence of update parameters with mean
-
a sequence of reference parameters with mean
-
a sequence of optimal parameters in hindsight
-
s.t.
-
s.t.
-
positive task-similarity
-
learning-rate for some
Theorem B.1.
In Setting B.1 define the regret upper-bound and the averaged regret upper-bound . Then in Algorithm 2 if the meta-learner uses FTL or AOGD to pick the meta-initialization and the within-task descent algorithm has regret upper-bounded by we have the following bound:
Here the expectation is taken over the randomness of the DP mechanism.
Proof.
Setting B.2.
In Setting B.1, assume loss functions are generated by picking some distribution over valid losses and then sampling of them i.i.d. Assume further that the expected loss of every such distribution satisfies -quadratic-growth (-QG): for some , any , and the closest minimizer of to we have
Furthermore, assume that these losses are -strongly-smooth:
Finally, assume that is unique for every .
Lemma B.1.
Let be a sequence of convex losses drawn i.i.d. from some distribution with risk being -QG and let be any of the optimal actions in hindsight. Then the closest minimum of to satisfies
Proof.
Lemma B.2.
Let be a sequence of -strongly-smooth, -Lipschitz convex losses drawn i.i.d. from some distribution with risk being -QG and let be the average iterate of running Algorithm 1 of Bassily et al. [3] with the appropriate parameters for obtaining -DP. If then the closest minimum of to satisfies
Proof.
The result follows by directly substituting Theorem 3.2 of Bassily et al. [3] into the definition of -QG:
∎
Proposition B.1.
In Setting B.2 we have and
Proof.
Corollary B.1.
Corollary B.2.
In Setting B.2 and under the assumptions of Corollary B.1, if the distribution of each task is sampled i.i.d. from some environment and then we have the following bound on the expected transfer risk when a new task is sampled from , samples are drawn i.i.d. from , and we run OGD with starting from and use the average of the resulting iterates as the learned parameter:
Here is any element of and the outer expectation is taken over and the randomness of the DP mechanism.
Appendix C Experiment Details
Datasets:
We train a next word predictor for two federated datasets: (1) The Shakespeare dataset as preprocessed by [7], and (2) a dataset constructed from Wikipedia articles, where each article is used as a different task. For each dataset, we set a fixed number of tokens per task, discard tasks with less tokens than the specified, and discard samples from those tasks with more. For Shakespeare, we set the number of tokens per task to tokens, leaving tasks for meta-training, for meta-validation, and for meta-testing. For Wikipedia, we set the number of tokens to , which corresponds to having tasks for meta-training, for meta-validation, and for meta-testing. For the meta-validation and meta-test tasks, of the tokens are used for local training, and the remaining for local testing.
Model Structure:
Our model first maps each token to an embedding of dimension before passing it through an LSTM of two layers of units each. The LSTM emits an output embedding, which is scored against all items of the vocabulary via dot product followed by a softmax. We build the vocabulary from the tokens in the meta-training set and fix its length to . We use a sequence length of for the LSTM and, just as [23], we evaluate using AccuracyTop1 (i.e., we only consider the predicted word to which the model assigned the highest probability) and consider all predictions of the unknown token as incorrect.
Hyperparameters:
We tune the hyperparameters on the set of meta-validation tasks. For all datasets and all versions of the meta-learning algorithm, we tune hyperparameters in a two step process. We first tune all the parameters that are not related to refinement: the meta learning rate, the local (within-task) meta-training learning rate, the maximum gradient norm, and the decay constant. Then, we use the configuration with the best accuracy pre-refinement and then tune the refinement parameters: the refine learning rate, refine batch size, and refine epochs.
All other hyperparameters are kept fixed for the sake of comparison: full batch steps were taken on within-task data, with the maximum number of microbatches used for the task-global DP model. The parameter search spaces are given in Tables 2, 3, 4. In these tables, the final hyperparameters we used are in bold.
Hyperparameter | Shakespeare- | Wiki- |
---|---|---|
Visits Per Task | ||
Tasks Per Round | ||
Within-Task Epochs | ||
Meta LR | ||
Meta Decay Rate | ||
Within-Task LR | ||
Clipping | ||
Refine LR | ||
Refine Mini-batch Size | ||
Refine Epochs |
Model | Shakespeare- | Wiki- |
---|---|---|
Visits Per Task | ||
Tasks Per Round | ||
Within-Task Epochs | ||
Meta LR | ||
Meta Decay Rate | ||
Within-Task LR | ||
Clipping | ||
Refine LR | ||
Refine Mini-batch Size | ||
Refine Epochs |
Model | Shakespeare- | Wiki- |
---|---|---|
Visits Per Task | ||
Tasks Per Round | ||
Within-Task Epochs | ||
Meta LR | ||
Meta Decay Rate | ||
Within-Task LR | ||
Clipping | ||
Refine LR | ||
Refine Mini-batch Size | ||
Refine Epochs |