1 Introduction
The goal of metalearning can be broadly defined as using the data of existing tasks to learn algorithms or representations that enable better or faster performance on unseen tasks. As the modern iteration of learningtolearn (LTL) Thrun & Pratt (1998)
, research on metalearning has been largely focused on developing new tools that can exploit the power of the latest neural architectures. Examples include the control of stochastic gradient descent (SGD) itself using a recurrent neural network
Ravi & Larochelle (2017) and learning deep embeddings that allow simple classification methods to work well Snell et al. (2017). A particularly simple but successful approach has been parametertransfer via gradientbased metalearning, which learns a metainitialization for a class of parametrized functions such that one or a few stochastic gradient steps on a few samples from a new task suffice to learn good taskspecific model parameters . For example, when presented with examples for an unseen task, the popular MAML algorithm Finn et al. (2017) outputs(1) 
for loss function
and learning rate ; is then used for inference on the task. Despite its simplicity, gradientbased metalearning is a leading approach for LTL in numerous domains including vision Li et al. (2017); Nichol et al. (2018); Kim et al. (2018), robotics AlShedivat et al. (2018), and federated learning Chen et al. (2018).While metainitialization is a more recent approach, methods for parametertransfer have long been studied in the multitask, transfer, and lifelong learning communities Evgeniou & Pontil (2004); Kuzborskij & Orabona (2013); Pentina & Lampert (2014). A common classical alternative to (1), which in modern parlance may be called metaregularization, is to learn a good bias for the following regularized empirical risk minimization (ERM) problem:
(2) 
Although there exist statistical guarantees and polytime algorithms for learning a metaregularization for simple models Pentina & Lampert (2014); Denevi et al. (2018b), such methods are impractical and do not scale to modern settings with deep neural architectures and many tasks. On the other hand, while the theoretically lessstudied metainitialization approach is often compared to metaregularization Finn et al. (2017), their connection is not rigorously understood.
In this work, we formalize this connection using the theory of online convex optimization (OCO) Zinkevich (2003), in which an intimate connection between initialization and regularization is wellunderstood due to the equivalence of online gradient descent (OGD) and followtheregularizedleader (FTRL) ShalevShwartz (2011); Hazan (2015). In the lifelong setting of an agent solving a sequence of OCO tasks, we use this connection to analyze an algorithm that learns a , which can be a metainitialization for OGD or a metaregularization for FTRL, such that the withintask regret of these algorithms improves with the similarity of the online tasks; here the similarity is measured by the distance between the optimal actions of each task and is not known beforehand. This algorithm, which we call FollowtheMetaRegularizedLeader ( FMRL or Ephemeral ), scales well in both computation and memory requirements, and in fact generalizes the gradientbased metalearning algorithm Reptile Nichol et al. (2018), thus providing a convexcase theoretical justification for a leading method in practice.
More specifically, we make the following contributions:

[leftmargin=*]

Our first result assumes a sequence of OCO tasks whose optimal actions are inside a small subset of the set of all possible actions. We show how Ephemeral can use these to make the average regret decrease in the diameter of and do no worse on dissimilar tasks. Furthermore, we extend a lower bound of Abernethy et al. (2008) to the multitask setting to show that Ephemeral is provably better than singletask learning and that one can do no more than a small constantfactor better sans stronger assumptions.

Under a realistic assumption on the loss functions, we show that Ephemeral also has lowregret guarantees in the practical setting where the optimal actions
are difficult or impossible to compute and the algorithm only has access to a statistical or numerical approximation. In particular, we show high probability regret bounds in the case when the approximation uses the gradients observed during withintask training, as is done in practice by Reptile
Nichol et al. (2018). 
We verify several assumptions and implications of our theory using a new metalearning dataset we introduce consisting of textclassification tasks solvable using convex methods. We further study the empirical suggestions of our theory in the deep learning setting.
1.1 Related Work
GradientBased MetaLearning: The modelagnostic metalearning (MAML) algorithm of Finn et al. (2017) pioneered this recent approach to LTL. A great deal of empirical work has studied and extended this approach Li et al. (2017); Grant et al. (2018); Nichol et al. (2018); Jerfel et al. (2018); in particular, Nichol et al. (2018) develop Reptile, a simple yet equally effective firstorder simplification of MAML for which our analysis shows provable guarantees as a subcase. Theoretically, Franceschi et al. (2018) provide computational convergence guarantees for gradientbased metalearning for stronglyconvex functions, while Finn & Levine (2018) show that with infinite data MAML can approximate any function of task samples assuming a specific neural architecture as the model. In contrast to both results, we show finitesample learningtheoretic guarantees for convex functions under a natural tasksimilarity assumption.
Online LTL: Learningtolearn and multitask learning (MTL) have both been extensively studied in the online setting, although our setting differs significantly from the one usually studied in online MTL Abernethy et al. (2007); Dekel et al. (2007); Cavallanti et al. (2010). There, in each round an agent is told which of a fixed set of tasks the current loss belongs to, whereas our analysis is in the lifelong setting, in which tasks arrive one at a time. Here there are many theoretical results for learning useful data representations Ruvolo & Eaton (2013); Pentina & Lampert (2014); Balcan et al. (2015); Alquier et al. (2017); the PACBayesian result of Pentina & Lampert (2014) can also be used for regularizationbased parameter transfer, which we also consider. Such methods are provable variants of practical sharedrepresentation approaches, e.g. ProtoNets Snell et al. (2017), but unlike our algorithms they do not scale to deep neural networks. Our work is especially related to Alquier et al. (2017), who also consider a dynamic, manytask notion of regret. We achieve similar bounds with a significantly more practical metaalgorithm, although withintask their results hold for any lowregret method whereas ours only hold for OCO.
Statistical LTL: While we focus on the online setting, our onlinetobatch conversion results also imply generalization bounds for distributional metalearning. The standard assumption of a distribution over tasks is due to Baxter (2000); Maurer (2005) further extended the hypothesisspacelearning framework to algorithmlearning. Recently, Amit & Meir (2018) showed PACBayesian generalization bounds for this setting, although without implying an efficient algorithm. Closely related to our work are the regularizationbased approaches of Denevi et al. (2018a, b)
, which provide statistical learning guarantees for Ridge regression with a metalearned kernel or bias.
Denevi et al. (2018b) is especially similar in spirit to our work in that it focuses on the usefulness of metalearning compared to singletask learning, showing that their method is better than the regularized ERM baseline. In contrast to our work, neither work provides algorithms that scale to more complex models or addresses the connection between lossregularization and gradientdescentinitialization.2 MetaInitialization & MetaRegularization
In this paper we study simple methods of the form shown in Algorithm 1, in which we run a withintask online algorithm on each new task and then update the initialization or regularization of this algorithm using a metaupdate online algorithm. Alquier et al. (2017) study a method of this form in which the metaupdate is conducted using exponentiallyweighted averaging. Our use of OCO for the metaupdate makes this class of algorithms much more practical; for example, in the case of OGD for both the inner and outer loop we recover the Reptile algorithm of Nichol et al. (2018).
In order to analyze this type of algorithm, we first discuss the OCO methods that make up both its inner and outer loop and the inherent connection they provide between initialization and regularization. We then make this connection explicit by formalizing the notion of learning a metainitialization or metaregularization as learning a parameterized Bregman regularizer. We conclude this section by proving convexcase upper and lower bounds on the taskaveraged regret.
2.1 Online Convex Optimization
In the online learning setting, at each time an agent chooses action and suffers loss for some adversarially chosen function that subsumes the loss, model, and data in into one function of . The goal is to minimize regret – the difference between the total loss and that of the optimal fixed action:
When then as the average loss of the agent will approach that of an optimal fixed action.
For OCO, is assumed convex and Lipschitz for all . This setting provides many practically useful algorithms such as online gradient descent (OGD). Parameterized by a starting point and learning rate , OGD plays
(3) 
and achieves sublinear regret when , where is the diameter of the action space .
Note the similarity between OGD and the metainitialization update in Equation 1. In fact another fundamental OCO algorithm, followtheregularizedleader (FTRL), is a direct analog for the metaregularization algorithm in Equation 2, with its action at each time being the output of regularized ERM over the previous data:
(4) 
Note that most definitions set . A crucial connection here is that on linear functions , OGD initialized at plays the same actions as FTRL. Since linear losses are the hardest losses, in that low regret for them implies low regret for convex functions Zinkevich (2003), in the online setting this equivalence suggests that metainitialization is a reasonable surrogate for metaregularization because it is solving the hardest version of the problem. The OGDFTRL equivalence can be extended to other geometries by replacing the squarednorm in (4) by a stronglyconvex function :
In the case of linear losses this is the online mirror descent (OMD) generalization of OGD. For Lipschitz losses, OMD and FTRL have the following wellknown regret guarantee (ShalevShwartz, 2011, Theorem 2.11):
(5) 
2.2 TaskAveraged Regret and Task Similarity
In this paper we consider the lifelong extension of the online learning setting, where now index a sequence of online learning problems, in each of which the agent must sequentially choose actions and suffer loss . Since in metalearning we are interested in doing well on individual tasks, we will aim to minimize a dynamic notion of regret in which the comparator changes with each task:
Definition 2.1.
The taskaveraged regret (TAR) of an online algorithm after tasks with steps is
As the comparator in this regret is dynamic, without very strong assumptions one cannot hope to achieve TAR decreasing in . A seeming remedy for this issue in our parametertransfer setting is to subtract from TAR a “metacomparator” that uses the optimal metainitialization or metaregularization in hindsight but with the same withintask algorithm. However, to prove regret sublinear in using this approach, one has to use lowregret algorithms with tight constants on their upper and lower bounds, as otherwise the agent will always suffer an worse loss on each task. Such tight bounds are known for very few algorithms Abernethy et al. (2008). Our study of TAR is thus motivated by an interest in understanding averagecase regret, as well as our derivation of an onlinetobatch conversion for generalization bounds on distributional LTL. Note also that TAR is similar to the compound regret studied by Alquier et al. (2017), although they also compete with the best representation in hindsight.
We now formalize our similarity assumption on the tasks : their optimal actions lie within a small subset of the action space. This is natural for studying gradientbased metalearning, as the notion that there exists a metaparameter from which a good parameter for any individual task is reachable with only a few steps implies that they are all close together. We develop algorithms whose TAR scales with the diameter of ; notably, this means they will not do much worse if , i.e. if the tasks are not related in this way, but will do well if . Importantly, our methods will not require knowledge of .
Assumption 2.1.
Assume each task consists of convex Lipschitz loss functions and let be the minimumnorm optimal action in hindsight for task . Define to be the minimal subset containing all .
Note is unique as the minimum of , a strongly convex function, over minima of a convex function. The algorithms in Section 2.4 assume an efficient oracle computing .
2.3 Parameterizing Bregman Regularizers
Following the main idea of gradientbased metalearning, our goal is to learn a such that an online algorithm such as OGD starting from will have low regret. We thus treat regret as our objective and observe that in the regret of FTRL (5), the regularizer effectively encodes a distance from the initialization to . This is clear in the Euclidean geometry for , but can be extended via the Bregman divergence Bregman (1967), defined for everywheresubdifferentiable and convex as
The Bregman divergence has many useful properties Banerjee et al. (2005) that allow us to use it almost directly as a parameterized regularization function. However, in order to use OCO for the metaupdate we also require it to be strictly convex in the second argument, a property that holds for the Bregman divergence of both the regularizer and the entropic regularizer used for online learning over the probability simplex, e.g. with expert advice.
Definition 2.2.
Let be 1stronglyconvex w.r.t. norm on convex . Then we call the Bregman divergence a Bregman regularizer if is strictly convex for any fixed .
Within each task, the regularizer is parameterized by the second argument and acts on the first. More specifically, for we have , and so in the case of FTRL and OGD, is a parameterization of the regularization and the initialization, respectively. In the case of the entropic regularizer, the associated Bregman regularizer is the KLdivergence from to and thus metalearning can very explicitly be seen as learning a prior.
Finally, we use Bregman regularizers to formally define our parameterized learning algorithms:
Definition 2.3.
, for , where is some bounded convex subset , plays
for Bregman regularizer . Similarly, plays
2.4 FollowtheMetaRegularizedLeader
We now specify the first variant of our main algorithm, FollowtheMetaRegularizedLeader (Ephemeral ). In the case where the diameter of , as measured by the square root of the maximum Bregman divergence between any two points, is known. Starting with , run or with on the losses in each task . After each task, compute using an OCO metaupdate algorithm operating on the Bregman divergences . For unknown, make an underestimate and multiply it by a factor each time .
The following is a regret bound for this algorithm when the metaupdate is either FollowtheLeader (FTL), which plays the minimizer of all past losses, or OGD with adaptive step size. We call this Ephemeral variant FollowtheAverageLeader (FAL) because in the case of FTL the algorithm uses the mean of the previous optimal parameters in hindsight as the initialization. Pseudocode for this and other variants is given in Algorithm 2. For brevity, we state results for ; detailed statements are in the supplement.
Theorem 2.1.
Proof Sketch.
We give a proof for and known task similarity, i.e. . A full proof is in the supplement. Use to denote the divergence to and let . Note is stronglyconvex and is the minimizer of their sum, with average distance . We can then expand Definition 2.1 for taskaveraged regret:
The first two lines just substitute the regret bound (5) of FTRL and OMD. The key step is the last one, where the regret is split into the lefthand loss of the metaupdate algorithm and the righthand loss of the loss incurred if we had always initialized at the mean of the optimal actions . Since is a sequence of stronglyconvex functions with minimizer , and since each is determined by playing FTL or OGD on these same functions, the lefthand term is exactly the regret of these algorithms on stronglyconvex functions, which is known to be Bartlett et al. (2008); Kakade & ShalevShwartz (2008). Substituting and the definition of sets the righthand to
∎
The full proof uses the doubling trick on the unknown task similarity , which requires an analysis of the location of metaparameter to ensure that we only increase the guess when needed. The extension to nonEuclidean geometries uses a novel logarithmic regret bound for FTL over Bregman regularizer losses.
Remark 2.1.
Note that if we know the standard deviation
of the task parameters from their mean , setting the learning rate in Algorithm 2 and following the same analysis as above will give taskaveraged regret , which is at least as good as the bound above sinceand is less sensitive to possible outlier tasks.
Theorem 2.1 shows that, so long as the similarity guess is not too large, the taskaveraged regret of Ephemeral will scale with the task similarity . The component shows that this bound improves if the are close on average; in the case we have . Furthermore, if , i.e. if the tasks are not similar, then the algorithm will only do a constant factor worse than FTRL or OMD; this is similar to other “optimistic” methods that work well under regularity Rakhlin & Sridharan (2013); Jadbabaie et al. (2015). These results show that gradientbased metalearning is useful in convex settings: under a simple notion of task similarity, using multiple tasks leads to better performance than the regret of running the same algorithm in a singletask setting. Furthermore, the algorithm scales well in terms of computation and memory requirements, and in the setting is very similar to Reptile Nichol et al. (2018).
However, it is easy to see that an even simpler “strawman” algorithm achieves regret only a constant factor worse than Ephemeral : at time , simply initialize FTRL or OMD using the optimal parameter of task . Of course, since such algorithms are often used in the fewshot setting of small , a reduction in the average regret is practically significant; we observe this empirically in Figure 3. Indeed, in the proof of Theorem 2.1 the regret converges to the regret bound obtained by always playing the mean of the optimal actions if we somehow knew it beforehand, which will not occur when playing the strawman algorithm. Furthermore, the following lowerbound on the taskaveraged regret, a multitask extension of Abernethy et al. (2008, Theorem 4.2), shows that such constant factor reductions are the best we can achieve under our task similarity assumption:
Corollary 2.1.
Assume and that for each an adversary must play a sequence of convex Lipschitz functions whose optimal actions in hindsight are contained in some fixed ball with center and diameter . Then the adversary can force the agent to have TAR at least .
More broadly, this lower bound shows that the learningtheoretic benefits of gradientbased metalearning are inherently limited without stronger assumptions on the tasks. Nevertheless, Ephemeralstyle algorithms are very attractive from a practical perspective, as their memory and computation requirements per iteration scale linearly in the dimension and not at all in the number of tasks.
3 Provable Guarantees for Practical GradientBased MetaLearning
In the previous section we showed that an algorithm with access to the best actions in hindsight of each task could learn a good metainitialization or metaregularization. In practice we may wish to be more computationally efficient and use a simplertocompute quantity for the metaupdate. In addition, in the i.i.d. case fewshot ERM may not be a good task representation and a task similarity assumption on the true risk minimizers may be more relevant. In this section we first show how two simple variants of Ephemeral handle these settings. Finally, we also provide an onlinetobatch conversion result for taskaveraged regret that implies good generalization guarantees when any of the variants of Ephemeral are run in the distributional LTL setting.
3.1 SimpletoCompute MetaUpdates
The FAL variant of Ephemeral uses each task’s minimumnorm optimal action in hindsight to perform a metaupdate. While
is efficiently computable in some cases, in most cases it is more efficient and practical to use an estimate instead. This is especially true when applying these methods in the deep learning setting; for example,
Nichol et al. (2018) find that taking the average withintask gradient works well. Furthermore, in the batch setting, when each task consists of i.i.d. samples drawn from an adversarially chosen distribution, a more natural notion of task similarity would depend on the true risk minimizer of each task, of which is just an estimate. We thus extend the results of Section 2.4 to handle these considerations by proving regret bounds for two variants of Ephemeral: one for the adversarial setting which uses the final action on task as the metaupdate, and one for the stochastic setting which uses the average iterate. We call these methods FLIOnline and FLIBatch, respectively, where FLI stands for FollowtheLastIterate.However, to achieve these guarantees we need to make some assumptions on the withintask loss functions. This is unavoidable because we need estimates of the optimal actions of different tasks to be nearby; in general, for some a convex function can have small but large if does not increase quickly away from the minimum. This makes it impossible to use guarantees on the loss of an estimate of to bound its distance from . We therefore make assumptions that some aggregate loss, e.g. the expectation or sum of the withintask losses, satisfies the following growth condition:
Definition 3.1.
A function has quadraticgrowth (QG) w.r.t. for if for any and its closest minimum of we have
QG has recently been used to provide fast rates for both offline and online GD that hold for practical problems such as LASSO and logistic regression under datadependent assumptions
Karimi et al. (2016); Garber (2019). It can be shown to hold for for stronglyconvex and some ; in this case Karimi et al. (2016). Note that QG will also be satisfied when itself is stronglyconvex, making the former a weaker condition.We start with FLIOnline; as shown in Algorithm 2, this variant is the same as FAL except that the metaupdate is performed using the last action of FTRL, i.e. the regularized empirical risk minimizer. To provide regret guarantees in this setting, we stipulate that the average loss is growing and strengthen the task similarity notion slightly:
Assumption 3.1.
Let each task consist of convex Lipschitz loss functions s.t. the total loss is QG w.r.t. . Define s.t. .
In contrast to Assumption 2.1, we require to contain all optimal actions and not only the one with minimal norm. Furthermore, we require that the growth factor is . While this is a stronger requirement than usually assumed, in Figure 2 we show that it holds in certain real and synthetic settings. Note that this growth factor will always hold in the case of the losses themselves being stronglyconvex.
Under such datadependent assumptions, and if the withintask algorithm is FTRL, we have the following bound:
Theorem 3.1.
Note that above regret is very similar to that in Theorem 2.1 apart from a pertask error term decreasing in that is due to the use of an estimate of .
We now turn to the FLIBatch algorithm, which uses each task’s average action for the metaupdate. To give a regret bound here, we assume that at each task , an adversary picks a distribution over loss functions, from which samples are drawn i.i.d. This follows the batchwithinonline setting of Alquier et al. (2017). We thus use the distance between the truerisk minimizers for the task similarity assumption:
Assumption 3.2.
Let each task consist of convex Lipschitz loss functions sampled i.i.d. from distribution s.t. the expected total loss is QG w.r.t. . Define s.t. .
We can show a high probability bound on the taskaveraged regret assuming stronglysmooth regularization:
Theorem 3.2.
3.2 Distributional LearningtoLearn
While gradientbased LTL algorithms are largely online, their goals are often statistical. Here we review distributional LTL and prove an onlinetobatch conversion showing that low TAR implies low risk for withintask learning.
As formulated by Baxter (2000), distributional LTL assumes a distribution over task distributions over . Given i.i.d. data samples from i.i.d. task samples , we seek to do well in expectation when new samples are drawn from a new distribution and we must learn how to predict given for . This models a setting with not enough data to learn on its own, i.e. is small, but where tasks are somehow related and thus we can use samples from to reduce the number of samples needed from . Parametertransfer LTL lies within the algorithmlearning framework of Maurer (2005), where tasks samples are used to learn a learning algorithm parameterized by that takes data points and returns a prediction algorithm parameterized by .
Theorem 3.3 bounds the withintask expected risk under taskaveraged regret guarantees for any task sample from the same distribution . For Ephemeral, the procedure picks a task uniformly at random, runs or on samples from , and outputs the average iterate as the learned parameter. Note that guarantees on randomly or mean iterates are standard, although in practice we use and the last iterate as the learned parameter.
Theorem 3.3.
Suppose convex losses are generated by sampling i.i.d. for some distribution over task distributions . Let be the state before some task picked uniformly at random of algorithm with taskaveraged regret . Then if new loss functions are sampled from a new task distribution , running on these losses will generate s.t. w.p. their mean satisfies
where the outer expectation is over sampling and
The result follows by nesting two different singletask onlinetobatch conversions CesaBianchi et al. (2004); CesaBianchi & Gentile (2005) via Jensen’s inequality. Note that the first term is the expected empirical risk of the ERM, which is small in many practical settings, such as for linear models over nonatomic distributions with . Thus, apart from the fastdecaying term, proving a low taskaveraged regret multiplicatively improves the bound as on the risk of a new task sampled from .
4 Empirical Results
A major benefit of Ephemeral is its practicality. In particular, the FLIBatch variant is scalable without modification to highdimensional, nonconvex models. Here its practical effectiveness is evidenced by the success of firstorder MAML and similar algorithms, as our method is a generalization of Reptile Nichol et al. (2018), which performs slightly worse than MAML on the Omniglot benchmark Lake et al. (2017)
but better on the harder MiniImageNet benchmark
Ravi & Larochelle (2017). With this evidence, empirically our main goal is to validate our theory in the convex setting, although we also examine implications for deep metalearning.4.1 Convex Setting
We introduce a new dataset of 812 classification tasks, each consisting of sentences from one of four Wikipedia pages which we use as labels. We call this dataset MiniWikipedia. Our use of text classification to examine the convex setting is motivated by the wellknown effectiveness of linear models over simple representations Wang & Manning (2012); Arora et al. (2018). We use logistic regression over continuousbagofwords (CBOW) vectors built using 50dimensional GloVe embeddings Pennington et al. (2014). The similarity of these tasks is verified by seeing if their optimal parameters are close together. As shown before in Figure 1, we find when is the unit ball that even in the 1shot setting the tasks have nonvacuous similarity; for 32shots the parameters are contained in a set of radius 0.32.
We next compare Ephemeral to the “strawman” algorithm from Section 2, which simply uses the previous optimal action as the initialization. For both algorithms we use task similarity guess and tuning parameter . As expected, we see in Figure 3 that the improvement of Ephemeral over the strawman is especially prominent for fewshot learning, showing that the theoretical tasksimilaritybased improvement we achieve is practically significant when individual tasks have very few samples. We also see that FLIBatch, which uses an estimate of the best parameter for the metaupdate, approaches the performance of FAL as the number of spaces increases and thus its estimate improves.
Finally, we evaluate the performance of Ephemeral and (firstorder) MAML in the distributional setting on this NLP task. On each task we standardize data using the mean and deviation of the training features. For Ephemeral we use the FAL variant with OGD as the withintask algorithm, with learning rate set using the average deviation of the optimal task parameters from the mean optimal parameter, as suggested in Remark 2.1
. For MAML, we use a hyperparameter sweep to determine the withintask and metaupdate learning rates; for our algorithm, we simply use the root average squared distance of all tasks in hindsight, which from Theorem
2.1 can be seen to be minimizing the upper bound on the withintask regret. As shown in Figure 4, even though Ephemeral does not require any learningrate tuning, unlike the MAML procedure, the algorithm performs comparably – slightly better for and slightly worse for .4.2 Deep Learning
While our algorithm generalizes Reptile, an alreadyeffective gradientbased metalearning algorithm Nichol et al. (2018), we can still see if improvements suggested by our theory help for neural network LTL. To this end we study controlled modifications to the settings used in the Reptile experiments on 5way and 20way Omniglot Lake et al. (2017) and 5way MiniImageNet classification Ravi & Larochelle (2017)
. For both datasets, we use the same convolution neural network as
Nichol et al. (2018), which were themselves taken from the MAML experiments. As in this prior work, our evaluations are conducted in the transductivesetting, in which test points are evaluated in batch, enabling sharing of batch normalization statistics.
Our theoretical results point to the importance of accurately computing the withintask parameter before the metaupdate; Theorem 2.1 assumes access to the optimal parameter in hindsight, whereas Theorems 3.1 and 3.2 allow computational and stochastic approximations that result in an additional error term decaying with , the number of withintask examples. This becomes relevant in the nonconvex setting with many thousands of tasks, where it can be infeasible to find even a local optimum.
The theory thus suggests that using a better estimate of the withintask parameter for the metaupdate may lead to lower regret, and thus lower generalization error. We can attain a better estimate by using more samples on each task, to reduce stochastic noise, or by running more gradient steps on each task, to reduce approximation error. It is not obvious that these changes will improve performance – it may be better to learn a fewshot learning algorithm using the same settings at metatrain and metatest time. However, in practice the Reptile authors use more task samples – 10 for Omniglot and 15 for MiniImageNet – at metatrain time than the number of shots – at most 5 – used for evaluation. On the other hand, they use far fewer withintask gradient steps – 5 for Omniglot and 8 for MiniImageNet – at metatrain time than the 50 iterations used for evaluation.
We study how varying these two settings – the number of task samples and the number of withintask iterations – changes performance at metatest time. In Figure 5, we see that more withintask samples provides a significant improvement in performance for MiniImageNet and Omniglot, with many fewer metaiterations needed to reach good test performance. Reducing the number of metaiterations is important in practice as it corresponds to fewer tasks needed for training, although for a better stochastic approximation each task needs more samples. On the other hand, increasing the number of training iterations does not need more samples, and we see in Figure 6 that raising this value can also lead to better performance, especially on 20way Omniglot, although the effect is less clear for MiniImageNet, with the use of more than 8 training iterations reducing performance. The latter result is likely due to overfitting on specific tasks, with task similarity in this stochastic setting likely holding for the true rather than empirical risk minimizers, as in Assumption 3.2. The broad patterns shown above also hold for several other parameter settings, which we depict in greater detail in the supplement.
5 Conclusion
In this paper we undertook a study of a broad class of gradientbased metalearning methods using the theory of online convex optimization. Our results show the usefulness of running such methods compared to singletask learning under the assumption that individual task parameters are close together. The fully online guarantees of our metaalgorithm, Ephemeral, can be extended to practically relevant approximate metaupdates, the batchwithinonline setting, and distributional LTL.
Apart from these results, the simplicity of Ephemeral makes it extensible to various settings of practical interest in metalearning, such as for federated learning and differential privacy. In the theoretical direction, future work can consider more sophisticated notions of tasksimilarity, such as for multimodal or continuouslyevolving settings. While we have studied only the parametertransfer setting, deriving statistical or lowregret guarantees for practical and scalable representationlearning remains an important research goal.
Acknowledgments
This work was supported in part by DARPA FA875017C0141, National Science Foundation grants CCF1535967, IIS1618714, IIS1705121, and IIS1838017, a Microsoft Research Faculty Fellowship, an Okawa Grant, a Google Faculty Award, an Amazon Research Award, an Amazon Web Services Award, and a Carnegie Bosch Institute Research Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.
References

Abernethy et al. (2007)
Abernethy, J., Bartlett, P., and Rakhlin, A.
Multitask learning with expert advice.
In
Proceedings of the International Conference on Computational Learning Theory
, 2007.  Abernethy et al. (2008) Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. Optimal strategies and minimax lower bounds for online convex games. 2008.
 AlShedivat et al. (2018) AlShedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via metalearning in nonstationary and competitive environments. In Proceedings of the 6th International Conference on Learning Representations, 2018.

Alquier et al. (2017)
Alquier, P., Mai, T. T., and Pontil, M.
Regret bounds for lifelong learning.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, 2017.  Amit & Meir (2018) Amit, R. and Meir, R. Metalearning by adjusting priors based on extended PACBayes theory. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Arora et al. (2018)
Arora, S., Khodak, M., Saunshi, N., and Vodrahalli, K.
A compressed sensing view of unsupervised text embeddings, bagofngrams, and LSTMs.
In Proceedings of the 6th International Conference on Learning Representations, 2018. 
Azuma (1967)
Azuma, K.
Weighted sums of certain dependent random variables.
Tôhoku Mathematical Journal, 19:357–367, 1967. 
Balcan et al. (2015)
Balcan, M.F., Blum, A., and Vempala, S.
Efficient representations for lifelong learning and autoencoding.
In Proceedings of the Conference on Learning Theory, 2015.  Banerjee et al. (2005) Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.
 Bartlett et al. (2008) Bartlett, P. L., Hazan, E., and Rakhlin, A. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, 2008.
 Baxter (2000) Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149–198, 2000.
 Bregman (1967) Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967.
 Cavallanti et al. (2010) Cavallanti, G., CesaBianchi, N., and Gentile, C. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11:2901–2934, 2010.
 CesaBianchi & Gentile (2005) CesaBianchi, N. and Gentile, C. Improved risk tail bounds for online algorithms. In Advances in Neural Information Processing Systems, 2005.
 CesaBianchi et al. (2004) CesaBianchi, N., Conconi, A., and Gentile, C. On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
 Chen et al. (2018) Chen, F., Dong, Z., Li, Z., and He, X. Federated metalearning for recommendation. arXiv, 2018.
 Dekel et al. (2007) Dekel, O., Long, P. M., and Singer, Y. Online learning of multiple tasks with a shared loss. Journal of Machine Learning Research, 8:2233–2264, 2007.
 Denevi et al. (2018a) Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incremental learningtolearn with statistical guarantees. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2018a.
 Denevi et al. (2018b) Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learning around a common mean. In Advances in Neural Information Processing Systems, 2018b.
 Duchi et al. (2010) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Proceedings of the Conference on Learning Theory, 2010.
 Evgeniou & Pontil (2004) Evgeniou, T. and Pontil, M. Regularized multitask learning. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.
 Fellbaum (1998) Fellbaum, C. WordNet: An Electronic Lexical Database. MIT Press, 1998.
 Finn & Levine (2018) Finn, C. and Levine, S. Metalearning and universality: Deep representations and gradient descent can approximate any learning algorithm. In Proceedings of the 6th International Conference on Learning Representations, 2018.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 Franceschi et al. (2018) Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and metalearning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Frank & Wolfe (1956) Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3, 1956.
 Garber (2019) Garber, D. Fast rates for online gradient descent without strong convexity via Hoffman’s bound. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
 Grant et al. (2018) Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradientbaed metalearning as hierarchical Bayes. In Proceedings of the 6th International Conference on Learning Representations, 2018.
 Hazan (2015) Hazan, E. Introduction to online convex optimization. In Foundations and Trends in Optimization, volume 2, pp. 157–325. now Publishers Inc., 2015.
 Jadbabaie et al. (2015) Jadbabaie, A., Rakhlin, A., and Shahrampour, S. Online optimization : Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.
 Jerfel et al. (2018) Jerfel, G., Grant, E., Griffiths, T. L., and Heller, K. Online gradientbased mixtures for transfer modulation in metalearning. arXiv, 2018.
 Kakade & ShalevShwartz (2008) Kakade, S. and ShalevShwartz, S. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, 2008.
 Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximalgradient methods under the PolyakŁojasiewicz condition. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016.
 Kim et al. (2018) Kim, J., Lee, S., Kim, S., Cha, M., Lee, J. K., Choi, Y., Choi, Y., Choi, D.Y., and Kim, J. AutoMeta: Automated gradient based meta learner search. arXiv, 2018.

Kuzborskij & Orabona (2013)
Kuzborskij, I. and Orabona, F.
Stability and hypothesis transfer learning.
In Proceedings of the 30th International Conference on Machine Learning, 2013.  Lake et al. (2017) Lake, B. M., Salakhutdinov, R., Gross, J., and Tenenbaum, J. B. One shot learning of simple visual concepts. In Proceedings of the Conference of the Cognitive Science Society (CogSci), 2017.
 Li et al. (2017) Li, Z., Zhou, F., Chen, F., and Li, H. MetaSGD: Learning to learning quickly for fewshot learning. arXiv, 2017.
 Maurer (2005) Maurer, A. Algorithmic stability and metalearning. Journal of Machine Learning Research, 6:967–994, 2005.
 Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. On firstorder metalearning algorithms. arXiv, 2018.

Pennington et al. (2014)
Pennington, J., Socher, R., and Manning, C. D.
Glove: Global vectors for word representation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
, 2014.  Pentina & Lampert (2014) Pentina, A. and Lampert, C. H. A PACBayesian bound for lifelong learning. In Proceedings of the 31st International Conference on Machine Learning, 2014.
 Polyak (1963) Polyak, B. T. Gradient methods for minimizing functionals. USSR Computational Mathematics and Mathematical Physics, 3(3):864–878, 1963.
 Rakhlin & Sridharan (2013) Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Proceedings of the Conference on Learning Theory, 2013.
 Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
 Ruvolo & Eaton (2013) Ruvolo, P. and Eaton, E. ELLA: An efficient lifelong learning algorithm. In Proceedings of the 30th International Conference on Machine Learning, 2013.
 ShalevShwartz (2011) ShalevShwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107––194, 2011.
 Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. S. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, 2017.
 Thrun & Pratt (1998) Thrun, S. and Pratt, L. Learning to Learn. Springer Science & Business Media, 1998.
 Wang & Manning (2012) Wang, S. and Manning, C. D. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012.
 Zinkevich (2003) Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.
Appendix A Background and Results for Online Convex Optimization
Throughout the appendix we assume all subsets are convex and in unless explicitly stated. Let be the dual norm of , which we assume to be any norm on , and note that the dual norm of is itself. For sequences of scalars we will use the notation to refer to the sum of the first of them. In the online learning setting, we will use the shorthand to denote the subgradient of evaluated at action . We will use to refer to the convex hull of a set of points .
a.1 Convex Functions
We first state the related definitions of strong convexity and strong smoothness:
Definition A.1.
An everywhere subdifferentiable function is stronglyconvex w.r.t. norm if
Definition A.2.
An everywhere subdifferentiable function is stronglysmooth w.r.t. norm if
We now turn to the Bregman divergence and a discussion of several useful properties Bregman (1967); Banerjee et al. (2005):
Definition A.3.
Let be an everywhere subdifferentiable strictly convex function. Its Bregman divergence is defined as
The definition directly implies that preserves the (strong or strict) convexity of for any fixed . Strict convexity further implies , with equality iff . Finally, if is stronglyconvex, or stronglysmooth, w.r.t. then Definition A.1 implies , or , respectively.
Claim A.1.
Let be a strictly convex function on , be a sequence satisfying , and . Then
Proof.
a.2 Online Algorithms
Here we provide a review of the online algorithms we use within each task. Our focus is on two closely related metaalgorithms, FollowtheRegularizedLeader (FTRL) and (linearized lazy) Online Mirror Descent (OMD). For a given Bregman regularizer , starting point , and fixed learning rate , the algorithms are as follows:

FTRL plays .

OMD plays .
This formulation makes the connection between the two algorithms – that they are equivalent in the linear case – very explicit. There exists a more standard formulation of OMD that is used to highlight its generalization of OGD – the case of – and the fact that the update is carried out in the dual space induced by the regularizer (Hazan, 2015, Section 5.3). However, we will only need the following regret bound for FTRL, which since it holds for all Lipschitz convex functions also holds for OMD when (ShalevShwartz, 2011, Theorem 2.11):
(6) 
We next review the online algorithms we use for the metaupdate. The main requirement here is logarithmic regret guarantees for the case of strongly convex loss functions. Two well known algorithms do so with the following guarantee on a sequence of functions indexed by that are stronglyconvex w.r.t. and Lipschitz w.r.t. :
(7) 
The algorithms and sources for this regret guarantee are the following:

FollowtheLeader (FTL), which plays (Kakade & ShalevShwartz, 2008, Theorem 2).

Adaptive Online Gradient Descent (OGD), which plays (Bartlett et al., 2008, Theorem 2.1).
Of course, these are not the only algorithms achieving logarithmic regret on strongly convex functions. For example, the popular AdaGrad algorithm also does so (Duchi et al., 2010, Theorem 13). However, the proof of the main result requires that the meta algorithm only play points in the convex hull of the points seen thus far. This is because we must stay in the smaller metalearned subset that we assume contains all the optimal parameters. Since we do not know this subset, we cannot use the projections most online methods use to remain feasible. We can easily show in the following claim that FTL and OGD satisfy these requirements but leave the extension to different metaupdate algorithms, either of this or of the main proof, to future work.
Claim A.2.
Let be a Bregman regularizer on and consider any for some convex subset . Then for loss sequence for any positive scalars , if we assume then FTL will play and OGD will as well if we further assume .
Proof.
The proof for FTL follows directly from Claim A.1 and the fact that the weighted average of a set of points is in their convex hull. For OGD we proceed by induction on . The base case holds by the assumption . In the inductive case, note that so the gradient update is , which is on the line segment between and , so the proof is complete by the convexity of . ∎
a.3 OnlinetoBatch Conversion
Finally, since we are also interested in distributional metalearning, we discuss standard techniques for converting regret guarantees into generalization bounds, which are usually named onlinetobatch conversions. In particular, for OCO we have the following bound on the risk of the average over the actions taking by an online algorithm, a result of applying Jensen’s inequality to Proposition 2 in CesaBianchi & Gentile (2005):
Proposition A.1.
Let be the actions of an online algorithm and let be convex loss functions drawn i.i.d. from some distribution . Then w.p. we have
where and is the average loss suffered by the agent.
Thus for distributions over bounded convex loss functions we can run a lowregret online algorithm and perform asymptotically as well as ERM in hindsight w.h.p. However, for our lifelong algorithms we are also considering online algorithms over the regret of withintask algorithms as a function of the initialization and learning rates. One can obtain a good action in expectation by picking one at random (CesaBianchi et al., 2004, Proposition 1):
Proposition A.2.
Let be the actions of an online algorithm and let be loss functions drawn i.i.d. from some distribution . Then we have
Note that CesaBianchi et al. (2004) only prove the first inequality; the second follows via the same argument but applying the symmetric version of the AzumaHoeffding inequality Azuma (1967). There is also a deterministic way to pick an action from using a penalized ERM approach with a high probability bound (CesaBianchi et al., 2004, Theorem 4); however, the algorithm, while computable in polynomial time, is practically not very efficient in our setting.
Appendix B Proofs of Main Theoretical Results
b.1 Upper and Lower Bounds for TaskAveraged Regret (Theorem 2.1 and Corollary 2.1)
We start with some technical lemmas. The first lowerbounds the regret of FTL when the loss functions are quadratic.
Lemma B.1.
For any and positive scalars define and let be any point in . Then
Proof.
We proceed by induction on . The base case follows directly since and so the second term is zero. In the inductive case we have
so it suffices to show