The goal of meta-learning can be broadly defined as using the data of existing tasks to learn algorithms or representations that enable better or faster performance on unseen tasks. As the modern iteration of learning-to-learn (LTL) Thrun & Pratt (1998)
, research on meta-learning has been largely focused on developing new tools that can exploit the power of the latest neural architectures. Examples include the control of stochastic gradient descent (SGD) itself using a recurrent neural networkRavi & Larochelle (2017) and learning deep embeddings that allow simple classification methods to work well Snell et al. (2017). A particularly simple but successful approach has been parameter-transfer via gradient-based meta-learning, which learns a meta-initialization for a class of parametrized functions such that one or a few stochastic gradient steps on a few samples from a new task suffice to learn good task-specific model parameters . For example, when presented with examples for an unseen task, the popular MAML algorithm Finn et al. (2017) outputs
for loss functionand learning rate ; is then used for inference on the task. Despite its simplicity, gradient-based meta-learning is a leading approach for LTL in numerous domains including vision Li et al. (2017); Nichol et al. (2018); Kim et al. (2018), robotics Al-Shedivat et al. (2018), and federated learning Chen et al. (2018).
While meta-initialization is a more recent approach, methods for parameter-transfer have long been studied in the multi-task, transfer, and lifelong learning communities Evgeniou & Pontil (2004); Kuzborskij & Orabona (2013); Pentina & Lampert (2014). A common classical alternative to (1), which in modern parlance may be called meta-regularization, is to learn a good bias for the following regularized empirical risk minimization (ERM) problem:
Although there exist statistical guarantees and poly-time algorithms for learning a meta-regularization for simple models Pentina & Lampert (2014); Denevi et al. (2018b), such methods are impractical and do not scale to modern settings with deep neural architectures and many tasks. On the other hand, while the theoretically less-studied meta-initialization approach is often compared to meta-regularization Finn et al. (2017), their connection is not rigorously understood.
In this work, we formalize this connection using the theory of online convex optimization (OCO) Zinkevich (2003), in which an intimate connection between initialization and regularization is well-understood due to the equivalence of online gradient descent (OGD) and follow-the-regularized-leader (FTRL) Shalev-Shwartz (2011); Hazan (2015). In the lifelong setting of an agent solving a sequence of OCO tasks, we use this connection to analyze an algorithm that learns a , which can be a meta-initialization for OGD or a meta-regularization for FTRL, such that the within-task regret of these algorithms improves with the similarity of the online tasks; here the similarity is measured by the distance between the optimal actions of each task and is not known beforehand. This algorithm, which we call Follow-the-Meta-Regularized-Leader ( FMRL or Ephemeral ), scales well in both computation and memory requirements, and in fact generalizes the gradient-based meta-learning algorithm Reptile Nichol et al. (2018), thus providing a convex-case theoretical justification for a leading method in practice.
More specifically, we make the following contributions:
Our first result assumes a sequence of OCO tasks whose optimal actions are inside a small subset of the set of all possible actions. We show how Ephemeral can use these to make the average regret decrease in the diameter of and do no worse on dissimilar tasks. Furthermore, we extend a lower bound of Abernethy et al. (2008) to the multi-task setting to show that Ephemeral is provably better than single-task learning and that one can do no more than a small constant-factor better sans stronger assumptions.
Under a realistic assumption on the loss functions, we show that Ephemeral also has low-regret guarantees in the practical setting where the optimal actions
are difficult or impossible to compute and the algorithm only has access to a statistical or numerical approximation. In particular, we show high probability regret bounds in the case when the approximation uses the gradients observed during within-task training, as is done in practice by ReptileNichol et al. (2018).
We verify several assumptions and implications of our theory using a new meta-learning dataset we introduce consisting of text-classification tasks solvable using convex methods. We further study the empirical suggestions of our theory in the deep learning setting.
1.1 Related Work
Gradient-Based Meta-Learning: The model-agnostic meta-learning (MAML) algorithm of Finn et al. (2017) pioneered this recent approach to LTL. A great deal of empirical work has studied and extended this approach Li et al. (2017); Grant et al. (2018); Nichol et al. (2018); Jerfel et al. (2018); in particular, Nichol et al. (2018) develop Reptile, a simple yet equally effective first-order simplification of MAML for which our analysis shows provable guarantees as a subcase. Theoretically, Franceschi et al. (2018) provide computational convergence guarantees for gradient-based meta-learning for strongly-convex functions, while Finn & Levine (2018) show that with infinite data MAML can approximate any function of task samples assuming a specific neural architecture as the model. In contrast to both results, we show finite-sample learning-theoretic guarantees for convex functions under a natural task-similarity assumption.
Online LTL: Learning-to-learn and multi-task learning (MTL) have both been extensively studied in the online setting, although our setting differs significantly from the one usually studied in online MTL Abernethy et al. (2007); Dekel et al. (2007); Cavallanti et al. (2010). There, in each round an agent is told which of a fixed set of tasks the current loss belongs to, whereas our analysis is in the lifelong setting, in which tasks arrive one at a time. Here there are many theoretical results for learning useful data representations Ruvolo & Eaton (2013); Pentina & Lampert (2014); Balcan et al. (2015); Alquier et al. (2017); the PAC-Bayesian result of Pentina & Lampert (2014) can also be used for regularization-based parameter transfer, which we also consider. Such methods are provable variants of practical shared-representation approaches, e.g. ProtoNets Snell et al. (2017), but unlike our algorithms they do not scale to deep neural networks. Our work is especially related to Alquier et al. (2017), who also consider a dynamic, many-task notion of regret. We achieve similar bounds with a significantly more practical meta-algorithm, although within-task their results hold for any low-regret method whereas ours only hold for OCO.
Statistical LTL: While we focus on the online setting, our online-to-batch conversion results also imply generalization bounds for distributional meta-learning. The standard assumption of a distribution over tasks is due to Baxter (2000); Maurer (2005) further extended the hypothesis-space-learning framework to algorithm-learning. Recently, Amit & Meir (2018) showed PAC-Bayesian generalization bounds for this setting, although without implying an efficient algorithm. Closely related to our work are the regularization-based approaches of Denevi et al. (2018a, b)
, which provide statistical learning guarantees for Ridge regression with a meta-learned kernel or bias.Denevi et al. (2018b) is especially similar in spirit to our work in that it focuses on the usefulness of meta-learning compared to single-task learning, showing that their method is better than the -regularized ERM baseline. In contrast to our work, neither work provides algorithms that scale to more complex models or addresses the connection between loss-regularization and gradient-descent-initialization.
2 Meta-Initialization & Meta-Regularization
In this paper we study simple methods of the form shown in Algorithm 1, in which we run a within-task online algorithm on each new task and then update the initialization or regularization of this algorithm using a meta-update online algorithm. Alquier et al. (2017) study a method of this form in which the meta-update is conducted using exponentially-weighted averaging. Our use of OCO for the meta-update makes this class of algorithms much more practical; for example, in the case of OGD for both the inner and outer loop we recover the Reptile algorithm of Nichol et al. (2018).
In order to analyze this type of algorithm, we first discuss the OCO methods that make up both its inner and outer loop and the inherent connection they provide between initialization and regularization. We then make this connection explicit by formalizing the notion of learning a meta-initialization or meta-regularization as learning a parameterized Bregman regularizer. We conclude this section by proving convex-case upper and lower bounds on the task-averaged regret.
2.1 Online Convex Optimization
In the online learning setting, at each time an agent chooses action and suffers loss for some adversarially chosen function that subsumes the loss, model, and data in into one function of . The goal is to minimize regret – the difference between the total loss and that of the optimal fixed action:
When then as the average loss of the agent will approach that of an optimal fixed action.
For OCO, is assumed convex and Lipschitz for all . This setting provides many practically useful algorithms such as online gradient descent (OGD). Parameterized by a starting point and learning rate , OGD plays
and achieves sublinear regret when , where is the diameter of the action space .
Note the similarity between OGD and the meta-initialization update in Equation 1. In fact another fundamental OCO algorithm, follow-the-regularized-leader (FTRL), is a direct analog for the meta-regularization algorithm in Equation 2, with its action at each time being the output of -regularized ERM over the previous data:
Note that most definitions set . A crucial connection here is that on linear functions , OGD initialized at plays the same actions as FTRL. Since linear losses are the hardest losses, in that low regret for them implies low regret for convex functions Zinkevich (2003), in the online setting this equivalence suggests that meta-initialization is a reasonable surrogate for meta-regularization because it is solving the hardest version of the problem. The OGD-FTRL equivalence can be extended to other geometries by replacing the squared-norm in (4) by a strongly-convex function :
In the case of linear losses this is the online mirror descent (OMD) generalization of OGD. For -Lipschitz losses, OMD and FTRL have the following well-known regret guarantee (Shalev-Shwartz, 2011, Theorem 2.11):
2.2 Task-Averaged Regret and Task Similarity
In this paper we consider the lifelong extension of the online learning setting, where now index a sequence of online learning problems, in each of which the agent must sequentially choose actions and suffer loss . Since in meta-learning we are interested in doing well on individual tasks, we will aim to minimize a dynamic notion of regret in which the comparator changes with each task:
The task-averaged regret (TAR) of an online algorithm after tasks with steps is
As the comparator in this regret is dynamic, without very strong assumptions one cannot hope to achieve TAR decreasing in . A seeming remedy for this issue in our parameter-transfer setting is to subtract from TAR a “meta-comparator” that uses the optimal meta-initialization or meta-regularization in hindsight but with the same within-task algorithm. However, to prove regret sublinear in using this approach, one has to use low-regret algorithms with tight constants on their upper and lower bounds, as otherwise the agent will always suffer an -worse loss on each task. Such tight bounds are known for very few algorithms Abernethy et al. (2008). Our study of TAR is thus motivated by an interest in understanding average-case regret, as well as our derivation of an online-to-batch conversion for generalization bounds on distributional LTL. Note also that TAR is similar to the compound regret studied by Alquier et al. (2017), although they also compete with the best representation in hindsight.
We now formalize our similarity assumption on the tasks : their optimal actions lie within a small subset of the action space. This is natural for studying gradient-based meta-learning, as the notion that there exists a meta-parameter from which a good parameter for any individual task is reachable with only a few steps implies that they are all close together. We develop algorithms whose TAR scales with the diameter of ; notably, this means they will not do much worse if , i.e. if the tasks are not related in this way, but will do well if . Importantly, our methods will not require knowledge of .
Assume each task consists of convex -Lipschitz loss functions and let be the minimum-norm optimal action in hindsight for task . Define to be the minimal subset containing all .
Note is unique as the minimum of , a strongly convex function, over minima of a convex function. The algorithms in Section 2.4 assume an efficient oracle computing .
2.3 Parameterizing Bregman Regularizers
Following the main idea of gradient-based meta-learning, our goal is to learn a such that an online algorithm such as OGD starting from will have low regret. We thus treat regret as our objective and observe that in the regret of FTRL (5), the regularizer effectively encodes a distance from the initialization to . This is clear in the Euclidean geometry for , but can be extended via the Bregman divergence Bregman (1967), defined for everywhere-sub-differentiable and convex as
The Bregman divergence has many useful properties Banerjee et al. (2005) that allow us to use it almost directly as a parameterized regularization function. However, in order to use OCO for the meta-update we also require it to be strictly convex in the second argument, a property that holds for the Bregman divergence of both the regularizer and the entropic regularizer used for online learning over the probability simplex, e.g. with expert advice.
Let be 1-strongly-convex w.r.t. norm on convex . Then we call the Bregman divergence a Bregman regularizer if is strictly convex for any fixed .
Within each task, the regularizer is parameterized by the second argument and acts on the first. More specifically, for we have , and so in the case of FTRL and OGD, is a parameterization of the regularization and the initialization, respectively. In the case of the entropic regularizer, the associated Bregman regularizer is the KL-divergence from to and thus meta-learning can very explicitly be seen as learning a prior.
Finally, we use Bregman regularizers to formally define our parameterized learning algorithms:
, for , where is some bounded convex subset , plays
for Bregman regularizer . Similarly, plays
We now specify the first variant of our main algorithm, Follow-the-Meta-Regularized-Leader (Ephemeral ). In the case where the diameter of , as measured by the square root of the maximum Bregman divergence between any two points, is known. Starting with , run or with on the losses in each task . After each task, compute using an OCO meta-update algorithm operating on the Bregman divergences . For unknown, make an underestimate and multiply it by a factor each time .
The following is a regret bound for this algorithm when the meta-update is either Follow-the-Leader (FTL), which plays the minimizer of all past losses, or OGD with adaptive step size. We call this Ephemeral variant Follow-the-Average-Leader (FAL) because in the case of FTL the algorithm uses the mean of the previous optimal parameters in hindsight as the initialization. Pseudo-code for this and other variants is given in Algorithm 2. For brevity, we state results for ; detailed statements are in the supplement.
We give a proof for and known task similarity, i.e. . A full proof is in the supplement. Use to denote the divergence to and let . Note is strongly-convex and is the minimizer of their sum, with average distance . We can then expand Definition 2.1 for task-averaged regret:
The first two lines just substitute the regret bound (5) of FTRL and OMD. The key step is the last one, where the regret is split into the left-hand loss of the meta-update algorithm and the right-hand loss of the loss incurred if we had always initialized at the mean of the optimal actions . Since is a sequence of strongly-convex functions with minimizer , and since each is determined by playing FTL or OGD on these same functions, the left-hand term is exactly the regret of these algorithms on strongly-convex functions, which is known to be Bartlett et al. (2008); Kakade & Shalev-Shwartz (2008). Substituting and the definition of sets the right-hand to
The full proof uses the doubling trick on the unknown task similarity , which requires an analysis of the location of meta-parameter to ensure that we only increase the guess when needed. The extension to non-Euclidean geometries uses a novel logarithmic regret bound for FTL over Bregman regularizer losses.
Theorem 2.1 shows that, so long as the similarity guess is not too large, the task-averaged regret of Ephemeral will scale with the task similarity . The component shows that this bound improves if the are close on average; in the -case we have . Furthermore, if , i.e. if the tasks are not similar, then the algorithm will only do a constant factor worse than FTRL or OMD; this is similar to other “optimistic” methods that work well under regularity Rakhlin & Sridharan (2013); Jadbabaie et al. (2015). These results show that gradient-based meta-learning is useful in convex settings: under a simple notion of task similarity, using multiple tasks leads to better performance than the regret of running the same algorithm in a single-task setting. Furthermore, the algorithm scales well in terms of computation and memory requirements, and in the setting is very similar to Reptile Nichol et al. (2018).
However, it is easy to see that an even simpler “strawman” algorithm achieves regret only a constant factor worse than Ephemeral : at time , simply initialize FTRL or OMD using the optimal parameter of task . Of course, since such algorithms are often used in the few-shot setting of small , a reduction in the average regret is practically significant; we observe this empirically in Figure 3. Indeed, in the proof of Theorem 2.1 the regret converges to the regret bound obtained by always playing the mean of the optimal actions if we somehow knew it beforehand, which will not occur when playing the strawman algorithm. Furthermore, the following lower-bound on the task-averaged regret, a multi-task extension of Abernethy et al. (2008, Theorem 4.2), shows that such constant factor reductions are the best we can achieve under our task similarity assumption:
Assume and that for each an adversary must play a sequence of convex -Lipschitz functions whose optimal actions in hindsight are contained in some fixed -ball with center and diameter . Then the adversary can force the agent to have TAR at least .
More broadly, this lower bound shows that the learning-theoretic benefits of gradient-based meta-learning are inherently limited without stronger assumptions on the tasks. Nevertheless, Ephemeral-style algorithms are very attractive from a practical perspective, as their memory and computation requirements per iteration scale linearly in the dimension and not at all in the number of tasks.
3 Provable Guarantees for Practical Gradient-Based Meta-Learning
In the previous section we showed that an algorithm with access to the best actions in hindsight of each task could learn a good meta-initialization or meta-regularization. In practice we may wish to be more computationally efficient and use a simpler-to-compute quantity for the meta-update. In addition, in the i.i.d. case few-shot ERM may not be a good task representation and a task similarity assumption on the true risk minimizers may be more relevant. In this section we first show how two simple variants of Ephemeral handle these settings. Finally, we also provide an online-to-batch conversion result for task-averaged regret that implies good generalization guarantees when any of the variants of Ephemeral are run in the distributional LTL setting.
3.1 Simple-to-Compute Meta-Updates
The FAL variant of Ephemeral uses each task’s minimum-norm optimal action in hindsight to perform a meta-update. While
is efficiently computable in some cases, in most cases it is more efficient and practical to use an estimate instead. This is especially true when applying these methods in the deep learning setting; for example,Nichol et al. (2018) find that taking the average within-task gradient works well. Furthermore, in the batch setting, when each task consists of i.i.d. samples drawn from an adversarially chosen distribution, a more natural notion of task similarity would depend on the true risk minimizer of each task, of which is just an estimate. We thus extend the results of Section 2.4 to handle these considerations by proving regret bounds for two variants of Ephemeral: one for the adversarial setting which uses the final action on task as the meta-update, and one for the stochastic setting which uses the average iterate. We call these methods FLI-Online and FLI-Batch, respectively, where FLI stands for Follow-the-Last-Iterate.
However, to achieve these guarantees we need to make some assumptions on the within-task loss functions. This is unavoidable because we need estimates of the optimal actions of different tasks to be nearby; in general, for some a convex function can have small but large if does not increase quickly away from the minimum. This makes it impossible to use guarantees on the loss of an estimate of to bound its distance from . We therefore make assumptions that some aggregate loss, e.g. the expectation or sum of the within-task losses, satisfies the following growth condition:
A function has -quadratic-growth (-QG) w.r.t. for if for any and its closest minimum of we have
QG has recently been used to provide fast rates for both offline and online GD that hold for practical problems such as LASSO and logistic regression under data-dependent assumptionsKarimi et al. (2016); Garber (2019). It can be shown to hold for for strongly-convex and some ; in this case Karimi et al. (2016). Note that -QG will also be satisfied when itself is strongly-convex, making the former a weaker condition.
We start with FLI-Online; as shown in Algorithm 2, this variant is the same as FAL except that the meta-update is performed using the last action of FTRL, i.e. the regularized empirical risk minimizer. To provide regret guarantees in this setting, we stipulate that the average loss is -growing and strengthen the task similarity notion slightly:
Let each task consist of convex -Lipschitz loss functions s.t. the total loss is -QG w.r.t. . Define s.t. .
In contrast to Assumption 2.1, we require to contain all optimal actions and not only the one with minimal norm. Furthermore, we require that the growth factor is . While this is a stronger requirement than usually assumed, in Figure 2 we show that it holds in certain real and synthetic settings. Note that this growth factor will always hold in the case of the losses themselves being -strongly-convex.
Under such data-dependent assumptions, and if the within-task algorithm is FTRL, we have the following bound:
Note that above regret is very similar to that in Theorem 2.1 apart from a per-task error term decreasing in that is due to the use of an estimate of .
We now turn to the FLI-Batch algorithm, which uses each task’s average action for the meta-update. To give a regret bound here, we assume that at each task , an adversary picks a distribution over loss functions, from which samples are drawn i.i.d. This follows the batch-within-online setting of Alquier et al. (2017). We thus use the distance between the true-risk minimizers for the task similarity assumption:
Let each task consist of convex -Lipschitz loss functions sampled i.i.d. from distribution s.t. the expected total loss is -QG w.r.t. . Define s.t. .
We can show a high probability bound on the task-averaged regret assuming strongly-smooth regularization:
3.2 Distributional Learning-to-Learn
While gradient-based LTL algorithms are largely online, their goals are often statistical. Here we review distributional LTL and prove an online-to-batch conversion showing that low TAR implies low risk for within-task learning.
As formulated by Baxter (2000), distributional LTL assumes a distribution over task distributions over . Given i.i.d. data samples from i.i.d. task samples , we seek to do well in expectation when new samples are drawn from a new distribution and we must learn how to predict given for . This models a setting with not enough data to learn on its own, i.e. is small, but where tasks are somehow related and thus we can use samples from to reduce the number of samples needed from . Parameter-transfer LTL lies within the algorithm-learning framework of Maurer (2005), where tasks samples are used to learn a learning algorithm parameterized by that takes data points and returns a prediction algorithm parameterized by .
Theorem 3.3 bounds the within-task expected risk under task-averaged regret guarantees for any task sample from the same distribution . For Ephemeral, the procedure picks a task uniformly at random, runs or on samples from , and outputs the average iterate as the learned parameter. Note that guarantees on randomly or mean iterates are standard, although in practice we use and the last iterate as the learned parameter.
Suppose convex losses are generated by sampling i.i.d. for some distribution over task distributions . Let be the state before some task picked uniformly at random of algorithm with task-averaged regret . Then if new loss functions are sampled from a new task distribution , running on these losses will generate s.t. w.p. their mean satisfies
where the outer expectation is over sampling and
The result follows by nesting two different single-task online-to-batch conversions Cesa-Bianchi et al. (2004); Cesa-Bianchi & Gentile (2005) via Jensen’s inequality. Note that the first term is the expected empirical risk of the ERM, which is small in many practical settings, such as for linear models over non-atomic distributions with . Thus, apart from the fast-decaying term, proving a low task-averaged regret multiplicatively improves the bound as on the risk of a new task sampled from .
4 Empirical Results
A major benefit of Ephemeral is its practicality. In particular, the FLI-Batch variant is scalable without modification to high-dimensional, non-convex models. Here its practical effectiveness is evidenced by the success of first-order MAML and similar algorithms, as our method is a generalization of Reptile Nichol et al. (2018), which performs slightly worse than MAML on the Omniglot benchmark Lake et al. (2017)
but better on the harder Mini-ImageNet benchmarkRavi & Larochelle (2017). With this evidence, empirically our main goal is to validate our theory in the convex setting, although we also examine implications for deep meta-learning.
4.1 Convex Setting
We introduce a new dataset of 812 classification tasks, each consisting of sentences from one of four Wikipedia pages which we use as labels. We call this dataset Mini-Wikipedia. Our use of text classification to examine the convex setting is motivated by the well-known effectiveness of linear models over simple representations Wang & Manning (2012); Arora et al. (2018). We use logistic regression over continuous-bag-of-words (CBOW) vectors built using 50-dimensional GloVe embeddings Pennington et al. (2014). The similarity of these tasks is verified by seeing if their optimal parameters are close together. As shown before in Figure 1, we find when is the unit ball that even in the 1-shot setting the tasks have non-vacuous similarity; for 32-shots the parameters are contained in a set of radius 0.32.
We next compare Ephemeral to the “strawman” algorithm from Section 2, which simply uses the previous optimal action as the initialization. For both algorithms we use task similarity guess and tuning parameter . As expected, we see in Figure 3 that the improvement of Ephemeral over the strawman is especially prominent for few-shot learning, showing that the theoretical task-similarity-based improvement we achieve is practically significant when individual tasks have very few samples. We also see that FLI-Batch, which uses an estimate of the best parameter for the meta-update, approaches the performance of FAL as the number of spaces increases and thus its estimate improves.
Finally, we evaluate the performance of Ephemeral and (first-order) MAML in the distributional setting on this NLP task. On each task we standardize data using the mean and deviation of the training features. For Ephemeral we use the FAL variant with OGD as the within-task algorithm, with learning rate set using the average deviation of the optimal task parameters from the mean optimal parameter, as suggested in Remark 2.1
. For MAML, we use a hyperparameter sweep to determine the within-task and meta-update learning rates; for our algorithm, we simply use the root average squared distance of all tasks in hindsight, which from Theorem2.1 can be seen to be minimizing the upper bound on the within-task regret. As shown in Figure 4, even though Ephemeral does not require any learning-rate tuning, unlike the MAML procedure, the algorithm performs comparably – slightly better for and slightly worse for .
4.2 Deep Learning
While our algorithm generalizes Reptile, an already-effective gradient-based meta-learning algorithm Nichol et al. (2018), we can still see if improvements suggested by our theory help for neural network LTL. To this end we study controlled modifications to the settings used in the Reptile experiments on 5-way and 20-way Omniglot Lake et al. (2017) and 5-way Mini-ImageNet classification Ravi & Larochelle (2017)
. For both datasets, we use the same convolution neural network asNichol et al. (2018), which were themselves taken from the MAML experiments. As in this prior work, our evaluations are conducted in the transductive
setting, in which test points are evaluated in batch, enabling sharing of batch normalization statistics.
Our theoretical results point to the importance of accurately computing the within-task parameter before the meta-update; Theorem 2.1 assumes access to the optimal parameter in hindsight, whereas Theorems 3.1 and 3.2 allow computational and stochastic approximations that result in an additional error term decaying with , the number of within-task examples. This becomes relevant in the non-convex setting with many thousands of tasks, where it can be infeasible to find even a local optimum.
The theory thus suggests that using a better estimate of the within-task parameter for the meta-update may lead to lower regret, and thus lower generalization error. We can attain a better estimate by using more samples on each task, to reduce stochastic noise, or by running more gradient steps on each task, to reduce approximation error. It is not obvious that these changes will improve performance – it may be better to learn a few-shot learning algorithm using the same settings at meta-train and meta-test time. However, in practice the Reptile authors use more task samples – 10 for Omniglot and 15 for Mini-ImageNet – at meta-train time than the number of shots – at most 5 – used for evaluation. On the other hand, they use far fewer within-task gradient steps – 5 for Omniglot and 8 for Mini-ImageNet – at meta-train time than the 50 iterations used for evaluation.
We study how varying these two settings – the number of task samples and the number of within-task iterations – changes performance at meta-test time. In Figure 5, we see that more within-task samples provides a significant improvement in performance for Mini-ImageNet and Omniglot, with many fewer meta-iterations needed to reach good test performance. Reducing the number of meta-iterations is important in practice as it corresponds to fewer tasks needed for training, although for a better stochastic approximation each task needs more samples. On the other hand, increasing the number of training iterations does not need more samples, and we see in Figure 6 that raising this value can also lead to better performance, especially on 20-way Omniglot, although the effect is less clear for Mini-ImageNet, with the use of more than 8 training iterations reducing performance. The latter result is likely due to over-fitting on specific tasks, with task similarity in this stochastic setting likely holding for the true rather than empirical risk minimizers, as in Assumption 3.2. The broad patterns shown above also hold for several other parameter settings, which we depict in greater detail in the supplement.
In this paper we undertook a study of a broad class of gradient-based meta-learning methods using the theory of online convex optimization. Our results show the usefulness of running such methods compared to single-task learning under the assumption that individual task parameters are close together. The fully online guarantees of our meta-algorithm, Ephemeral, can be extended to practically relevant approximate meta-updates, the batch-within-online setting, and distributional LTL.
Apart from these results, the simplicity of Ephemeral makes it extensible to various settings of practical interest in meta-learning, such as for federated learning and differential privacy. In the theoretical direction, future work can consider more sophisticated notions of task-similarity, such as for multi-modal or continuously-evolving settings. While we have studied only the parameter-transfer setting, deriving statistical or low-regret guarantees for practical and scalable representation-learning remains an important research goal.
This work was supported in part by DARPA FA875017C0141, National Science Foundation grants CCF-1535967, IIS-1618714, IIS-1705121, and IIS-1838017, a Microsoft Research Faculty Fellowship, an Okawa Grant, a Google Faculty Award, an Amazon Research Award, an Amazon Web Services Award, and a Carnegie Bosch Institute Research Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.
Abernethy et al. (2007)
Abernethy, J., Bartlett, P., and Rakhlin, A.
Multitask learning with expert advice.
Proceedings of the International Conference on Computational Learning Theory, 2007.
- Abernethy et al. (2008) Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. Optimal strategies and minimax lower bounds for online convex games. 2008.
- Al-Shedivat et al. (2018) Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via meta-learning in nonstationary and competitive environments. In Proceedings of the 6th International Conference on Learning Representations, 2018.
Alquier et al. (2017)
Alquier, P., Mai, T. T., and Pontil, M.
Regret bounds for lifelong learning.
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
- Amit & Meir (2018) Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In Proceedings of the 35th International Conference on Machine Learning, 2018.
Arora et al. (2018)
Arora, S., Khodak, M., Saunshi, N., and Vodrahalli, K.
A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and LSTMs.In Proceedings of the 6th International Conference on Learning Representations, 2018.
Weighted sums of certain dependent random variables.Tôhoku Mathematical Journal, 19:357–367, 1967.
Balcan et al. (2015)
Balcan, M.-F., Blum, A., and Vempala, S.
Efficient representations for lifelong learning and autoencoding.In Proceedings of the Conference on Learning Theory, 2015.
- Banerjee et al. (2005) Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.
- Bartlett et al. (2008) Bartlett, P. L., Hazan, E., and Rakhlin, A. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, 2008.
- Baxter (2000) Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149–198, 2000.
- Bregman (1967) Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967.
- Cavallanti et al. (2010) Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11:2901–2934, 2010.
- Cesa-Bianchi & Gentile (2005) Cesa-Bianchi, N. and Gentile, C. Improved risk tail bounds for on-line algorithms. In Advances in Neural Information Processing Systems, 2005.
- Cesa-Bianchi et al. (2004) Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
- Chen et al. (2018) Chen, F., Dong, Z., Li, Z., and He, X. Federated meta-learning for recommendation. arXiv, 2018.
- Dekel et al. (2007) Dekel, O., Long, P. M., and Singer, Y. Online learning of multiple tasks with a shared loss. Journal of Machine Learning Research, 8:2233–2264, 2007.
- Denevi et al. (2018a) Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incremental learning-to-learn with statistical guarantees. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2018a.
- Denevi et al. (2018b) Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learning around a common mean. In Advances in Neural Information Processing Systems, 2018b.
- Duchi et al. (2010) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Proceedings of the Conference on Learning Theory, 2010.
- Evgeniou & Pontil (2004) Evgeniou, T. and Pontil, M. Regularized multi-task learning. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.
- Fellbaum (1998) Fellbaum, C. WordNet: An Electronic Lexical Database. MIT Press, 1998.
- Finn & Levine (2018) Finn, C. and Levine, S. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In Proceedings of the 6th International Conference on Learning Representations, 2018.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
- Franceschi et al. (2018) Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Frank & Wolfe (1956) Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3, 1956.
- Garber (2019) Garber, D. Fast rates for online gradient descent without strong convexity via Hoffman’s bound. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
- Grant et al. (2018) Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-baed meta-learning as hierarchical Bayes. In Proceedings of the 6th International Conference on Learning Representations, 2018.
- Hazan (2015) Hazan, E. Introduction to online convex optimization. In Foundations and Trends in Optimization, volume 2, pp. 157–325. now Publishers Inc., 2015.
- Jadbabaie et al. (2015) Jadbabaie, A., Rakhlin, A., and Shahrampour, S. Online optimization : Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.
- Jerfel et al. (2018) Jerfel, G., Grant, E., Griffiths, T. L., and Heller, K. Online gradient-based mixtures for transfer modulation in meta-learning. arXiv, 2018.
- Kakade & Shalev-Shwartz (2008) Kakade, S. and Shalev-Shwartz, S. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, 2008.
- Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016.
- Kim et al. (2018) Kim, J., Lee, S., Kim, S., Cha, M., Lee, J. K., Choi, Y., Choi, Y., Choi, D.-Y., and Kim, J. Auto-Meta: Automated gradient based meta learner search. arXiv, 2018.
Kuzborskij & Orabona (2013)
Kuzborskij, I. and Orabona, F.
Stability and hypothesis transfer learning.In Proceedings of the 30th International Conference on Machine Learning, 2013.
- Lake et al. (2017) Lake, B. M., Salakhutdinov, R., Gross, J., and Tenenbaum, J. B. One shot learning of simple visual concepts. In Proceedings of the Conference of the Cognitive Science Society (CogSci), 2017.
- Li et al. (2017) Li, Z., Zhou, F., Chen, F., and Li, H. Meta-SGD: Learning to learning quickly for few-shot learning. arXiv, 2017.
- Maurer (2005) Maurer, A. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6:967–994, 2005.
- Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. arXiv, 2018.
Pennington et al. (2014)
Pennington, J., Socher, R., and Manning, C. D.
Glove: Global vectors for word representation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014.
- Pentina & Lampert (2014) Pentina, A. and Lampert, C. H. A PAC-Bayesian bound for lifelong learning. In Proceedings of the 31st International Conference on Machine Learning, 2014.
- Polyak (1963) Polyak, B. T. Gradient methods for minimizing functionals. USSR Computational Mathematics and Mathematical Physics, 3(3):864–878, 1963.
- Rakhlin & Sridharan (2013) Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Proceedings of the Conference on Learning Theory, 2013.
- Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
- Ruvolo & Eaton (2013) Ruvolo, P. and Eaton, E. ELLA: An efficient lifelong learning algorithm. In Proceedings of the 30th International Conference on Machine Learning, 2013.
- Shalev-Shwartz (2011) Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107––194, 2011.
- Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. S. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.
- Thrun & Pratt (1998) Thrun, S. and Pratt, L. Learning to Learn. Springer Science & Business Media, 1998.
- Wang & Manning (2012) Wang, S. and Manning, C. D. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012.
- Zinkevich (2003) Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.
Appendix A Background and Results for Online Convex Optimization
Throughout the appendix we assume all subsets are convex and in unless explicitly stated. Let be the dual norm of , which we assume to be any norm on , and note that the dual norm of is itself. For sequences of scalars we will use the notation to refer to the sum of the first of them. In the online learning setting, we will use the shorthand to denote the subgradient of evaluated at action . We will use to refer to the convex hull of a set of points .
a.1 Convex Functions
We first state the related definitions of strong convexity and strong smoothness:
An everywhere sub-differentiable function is -strongly-convex w.r.t. norm if
An everywhere sub-differentiable function is -strongly-smooth w.r.t. norm if
Let be an everywhere sub-differentiable strictly convex function. Its Bregman divergence is defined as
The definition directly implies that preserves the (strong or strict) convexity of for any fixed . Strict convexity further implies , with equality iff . Finally, if is -strongly-convex, or -strongly-smooth, w.r.t. then Definition A.1 implies , or , respectively.
Let be a strictly convex function on , be a sequence satisfying , and . Then
By Definition A.3 the last expression has a unique minimum at . ∎
a.2 Online Algorithms
Here we provide a review of the online algorithms we use within each task. Our focus is on two closely related meta-algorithms, Follow-the-Regularized-Leader (FTRL) and (linearized lazy) Online Mirror Descent (OMD). For a given Bregman regularizer , starting point , and fixed learning rate , the algorithms are as follows:
FTRL plays .
OMD plays .
This formulation makes the connection between the two algorithms – that they are equivalent in the linear case – very explicit. There exists a more standard formulation of OMD that is used to highlight its generalization of OGD – the case of – and the fact that the update is carried out in the dual space induced by the regularizer (Hazan, 2015, Section 5.3). However, we will only need the following regret bound for FTRL, which since it holds for all -Lipschitz convex functions also holds for OMD when (Shalev-Shwartz, 2011, Theorem 2.11):
We next review the online algorithms we use for the meta-update. The main requirement here is logarithmic regret guarantees for the case of strongly convex loss functions. Two well known algorithms do so with the following guarantee on a sequence of functions indexed by that are -strongly-convex w.r.t. and -Lipschitz w.r.t. :
The algorithms and sources for this regret guarantee are the following:
Follow-the-Leader (FTL), which plays (Kakade & Shalev-Shwartz, 2008, Theorem 2).
Adaptive Online Gradient Descent (OGD), which plays (Bartlett et al., 2008, Theorem 2.1).
Of course, these are not the only algorithms achieving logarithmic regret on strongly convex functions. For example, the popular AdaGrad algorithm also does so (Duchi et al., 2010, Theorem 13). However, the proof of the main result requires that the meta algorithm only play points in the convex hull of the points seen thus far. This is because we must stay in the smaller meta-learned subset that we assume contains all the optimal parameters. Since we do not know this subset, we cannot use the projections most online methods use to remain feasible. We can easily show in the following claim that FTL and OGD satisfy these requirements but leave the extension to different meta-update algorithms, either of this or of the main proof, to future work.
Let be a Bregman regularizer on and consider any for some convex subset . Then for loss sequence for any positive scalars , if we assume then FTL will play and OGD will as well if we further assume .
The proof for FTL follows directly from Claim A.1 and the fact that the weighted average of a set of points is in their convex hull. For OGD we proceed by induction on . The base case holds by the assumption . In the inductive case, note that so the gradient update is , which is on the line segment between and , so the proof is complete by the convexity of . ∎
a.3 Online-to-Batch Conversion
Finally, since we are also interested in distributional meta-learning, we discuss standard techniques for converting regret guarantees into generalization bounds, which are usually named online-to-batch conversions. In particular, for OCO we have the following bound on the risk of the average over the actions taking by an online algorithm, a result of applying Jensen’s inequality to Proposition 2 in Cesa-Bianchi & Gentile (2005):
Let be the actions of an online algorithm and let be convex loss functions drawn i.i.d. from some distribution . Then w.p. we have
where and is the average loss suffered by the agent.
Thus for distributions over bounded convex loss functions we can run a low-regret online algorithm and perform asymptotically as well as ERM in hindsight w.h.p. However, for our lifelong algorithms we are also considering online algorithms over the regret of within-task algorithms as a function of the initialization and learning rates. One can obtain a good action in expectation by picking one at random (Cesa-Bianchi et al., 2004, Proposition 1):
Let be the actions of an online algorithm and let be loss functions drawn i.i.d. from some distribution . Then we have
Note that Cesa-Bianchi et al. (2004) only prove the first inequality; the second follows via the same argument but applying the symmetric version of the Azuma-Hoeffding inequality Azuma (1967). There is also a deterministic way to pick an action from using a penalized ERM approach with a high probability bound (Cesa-Bianchi et al., 2004, Theorem 4); however, the algorithm, while computable in polynomial time, is practically not very efficient in our setting.
Appendix B Proofs of Main Theoretical Results
We start with some technical lemmas. The first lower-bounds the regret of FTL when the loss functions are quadratic.
For any and positive scalars define and let be any point in . Then
We proceed by induction on . The base case follows directly since and so the second term is zero. In the inductive case we have
so it suffices to show