Meta-learning [5, 6, 7] was designed to mimic this human ability. A meta-learning algorithm is first given a set of meta-training tasks assumed to be drawn from some distribution, and attempts to extract prior knowledge applicable to all tasks in the form of a meta-learner. This meta-learner is then evaluated on an unseen task, usually assumed to be drawn from a similar distribution as the one for training. Although meta-learning has developed rapidly in recent years, it typically assumes all meta-training tasks are available together as a batch, which doesn’t capture the sequential setting of continual lifelong learning in which new tasks are revealed one after another.
Meanwhile, online learning  specifically tackles the sequential setting. At each round , the algorithm picks an , and suffers a loss revealed by a potentially adversarial environment. The goal is to minimize the regret, the difference between the cumulative losses suffered by the algorithm and that of any fixed predictor, formally:
Yet, online learning sees the whole process as a single task without adaptation for each single step.
Neither paradigm alone is ideal for the continual lifelong learning scenario, thus, Finn et al.  proposed to combine them together to construct the Online Meta-Learning framework which will be discussed in Section 2. However, this framework has a strong convexity assumption, while many problems of current interest have a non-convex nature. Thus, in Section 3, we generalize this framework to the non-convex setting. Section 4 presents an exemplification of our algorithm with rigorous theoretical proofs of its performance guarantee. Real data experiment results are shown in Section 5. In the end, concluding remarks and takeaways are provided in Section 6. To the best of our knowledge, it is the first theoretical regret analysis for non-convex online meta-learning algorithms, shedding the light of applying online meta-learning for more challenging learning problems in the paradigm of deep neural networks.
We use bold letters to denote vectors, e.g.,. The th coordinate of a vector is . Unless explicitly noted, we study the Euclidean space with the inner product , and the Euclidean norm. We assume everywhere our objective function is bounded from below and denote the infimum by . The gradient of a function at is . .
Algorithm 1 is the online meta-learning framework proposed in . A meta-learner is maintained to preserve the prior knowledge learned from past rounds. Upon seeing a new task , one is first given some training data for adapting to the current task following some strategy . Then the test data will be revealed for evaluating the performance of the adapted learner . The loss suffered at this round can then be fed into an online learning algorithm to update . We use following  where is a step-size.
which competes with any fixed meta-learner. Under this, they designed the Follow the Meta Leader algorithm enjoying a logarithmic regret when assuming strong-convexity on .
3 Problem Formulation
In this section, we generalize the online meta-learning algorithm to non-convex setting by first demonstrating the infeasibility of regret of form (1) and then introducing an alternative performance measure.
Finding the global minimum for a non-convex function in general is known to be NP-hard. Yet, if we could find an online learning algorithm with a regret for some non-convex function classes, we can optimize any function of that class efficiently: simply run the online learning algorithm but with the objective as the loss at each round, and choose a random update as output. This gives us:
which leads to a contradiction unless P=NP. Thus, we have to find another performance measure for the non-convex case. One potential candidate is the local regret proposed by Hazan et al. :
where , , and for . The reason for using sliding-window in , especially a large window, can be justified by Theorem 2.7 in .
4 Algorithm & Theoretical Guarantees
4.1 Stochasticity of Online Meta-learning Algorithms
is typically just a random sample batch of the whole test-set, the losses and gradients obtained at each round are thus (unbiased) estimates of the true ones. This is the stochastic setting which we formalize by making following assumptions.
We assume that at each round , each call to any stochastic gradient oracle , , yields an i.i.d. random vector with the following properties:
Mutual independence: for ,
where , and denotes the conditional expectation of with respect to . Also note that for .
Hazan et al. proposed a time-smoothed online gradient descent algorithm  for such case. Yet, that algorithm’s performance critically relies on the choice of the step-size , and may even diverge if where is the (often unknown) smoothness of the loss function. We thus propose to use the AdaGrad-Norm  algorithm (Algorithm 2) as the online learning algorithm in Algorithm 1 instead. Here, is the initialization of the accumulated squared norms and prevents division by 0, while is to ensure homogeneity and that the units match.
4.2 Convergence Analysis
We present below an analysis of this algorithm assuming the loss function satisfies:
is twice differentiable and :
Note that this implies [12, Lemma 1.2.3]:
Assuming Assumption 2 of , we can derive the following properties of (the proof can be found in the Appendix):
Assuming Assumption 2, is -Bounded, -Lipschitz, and -smooth.
The following theorem shows that by selecting , a logarithmic regret of the algorithm is guaranteed w.r.t. any .
Before showing the proof of Theorem 1, we need the following technical lemmas whose proofs can be found in the Appendix. For simplicity, we denote as condition on and take expectation w.r.t. :
As , and , Assumption 1 gives us:
Given Assumption 2(d), we have: .
Lemma 4 (, Lemma 9).
Let be a nonincreasing function, and for . Then
Proof of Theorem 1.
The proof follows that of Theorem 2.1 in .
Denote , and take expectation w.r.t. conditioned on (namely ) :
Second, from the definition of and we have:
Using this, and Jensen’s inequality on which is a convex function, we can upper-bound Equation (5) by its absolute value which in turn can be upper-bounded by:
Third, by using inequality with , , Equation (7) can be upper bounded by:
where we used that holds for .
Applying again but with , , Equation (8) can be upper-bounded by:
Fourth, putting above two inequalities back, and then in turn put the result back into Equation (5) gives us:
Rearrange terms, then for both sides, take expectation w.r.t. and sum from to :
As , letting be in Lemma 4 gives us:
where we used Jensen’s inequality for which is a concave function in .
Since each is -Lipschitz, so is , thus, using Cauchy-Schwartz inequality:
Finally, using Markov’s inequality, with probability :
Denote . Using similar derivation in Equation (12), with probability we have:
This means, with probability , we have:
Denote Equation (14) as , and use Markov’s inequality again we have, with probability :
Therefore, with probability :
By solving the above "quadratic" inequality of and letting , we arrive at the end.
We evaluated our algorithm on the few-shot image classification task of the Omniglot  dataset which consists of 20 instances of 1623 characters from 50 different alphabets. The dataset is augmented with rotations by multiples of 90 degrees following .
We employed the -way -shot protocol : at each round, pick unseen characters irrespective of alphabets. Provide the meta-learner with different drawings of each of the characters as the training set , then evaluate the adapted model ’s ability on new unseen instances within the classes (namely the test set ). We chose the 5-way 5-shot scheme, and used 15 samples per character for testing following .
The model we used is a CNN following . It contains 4 modules, each of which is a 3
3 convolution with 64 filters followed by batch normalization
, a ReLu non-linearity and 2
2 max-pooling. Images are downsampled to 2828 so that the resulting feature map of the last hidden layer is 1164. The last layer is fed into a fully connected layer and the loss we used is the Cross-Entropy loss.
To study if our algorithm provides any empirical benefit over traditional methods, we compare it to two benchmark algorithms : Train on Everything (TOE), and Train from Scratch (TFS). On each round , both initialize a new model. The difference is that TOE trains over all available data, both training and testing, from all past tasks, plus at current round, while TFS only uses for training.
The experiments are performed in PyTorch, and parameters are by default if no specification is provided. For the parameter in the local adapter strategy in Algorithm 1, we set it to be 0.1 everywhere, and the gradient descent step is performed only once for each task. For the AdaGrad-Norm algorithm (Algorithm 2) we used, we set as suggested in the original paper . The TFS and TOE used Adam  with default parameters.
The result is shown in Figure 1 which suggests that our algorithm gradually accumulates prior knowledge, which enables fast learning of later tasks. TFS provides a good example of how CNN performs when the training data is scarse. On the contrary, TOE behaves nearly as random guessing. The inferiority of TOE to TFS is somehow surprising, as TOE has much more training data than TFS. The reason is that TOE regards all training data as coming from a single distribution, and tries to learn a model that works for all tasks. Thus, when tasks are substantially different from each other, TOE might even incur negative transfer and fail to solve any single task as has been observed in . Meanwhile, by using training data of the current task only, TFS avoids negative transfer, but also rules out learning of any connection between tasks. Our algorithm, in contrast, is designed to discover common structures across tasks, and use these information to guide fast adaptation to new tasks.
The continual lifelong learning problem is common in real-life, where an agent needs to accumulate knowledge from every task it encounters, and utilizes that knowledge for fast learning of new tasks. To solve this problem, we can combine the meta-learning and the online-learning paradigms to form the online meta-learning framework. In this work, we generalized this framework to the non-convex setting, and introduced the local regret to replace the original regret definition. We applied it to the stochastic setting, and showed its superiority both in theory and practice. In the future work, we would like to evaluate our algorithm on harder learning problems over larger scale datasets such as ImageNet.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, Nature, vol. 521, no. 7553, pp. 436, 2015.
-  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484, 2016.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in , 2016, pp. 779–788.
-  Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” in Proceedings of International Conference on Machine Learning, 2016, pp. 173–182.
-  Devang K Naik and RJ Mammone, “Meta-neural networks that learn by learning,” in Proceedings of International Joint Conference on Neural Networks. IEEE, 1992, vol. 1, pp. 437–442.
-  Sebastian Thrun and Lorien Pratt, Learning to learn, Springer Science & Business Media, 2012.
-  Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al., “Matching networks for one shot learning,” in Proceedings of Advances in Neural Information Processing Systems, 2016, pp. 3630–3638.
-  Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, learning, and games, Cambridge university press, 2006.
-  Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine, “Online meta-learning,” in Proceedings of International Conference on Machine Learning, 2019, pp. 1920–1930.
-  Elad E Hazan, Karan Singh, and Cyril Zhang, “Efficient regret minimization in non-convex games,” in Proceedings of International Conference on Machine Learning, 2017, pp. 2278–2288.
-  Rachel Ward, Xiaoxia Wu, and Leon Bottou, “Adagrad stepsizes: sharp convergence over nonconvex landscapes,” in Proceedings of International Conference on Machine Learning, 2019, pp. 6677–6686.
-  Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer, 2003.
Xiaoyu Li and Francesco Orabona,
“On the convergence of stochastic gradient descent with adaptive stepsizes,”in
The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 983–992.
-  Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
-  Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap, “Meta-learning with memory-augmented neural networks,” in Proceedings of International Conference on Machine Learning, 2016, pp. 1842–1850.
-  Sachin Ravi and Hugo Larochelle, “Optimization as a model for few-shot learning,” International Conference on Learning Representations, 2017.
-  Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of International Conference on Machine Learning, 2015, pp. 448–456.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” 2017.
-  Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2014.
Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov,
“Actor-mimic: Deep multitask and transfer reinforcement learning,”International Conference on Learning Representations, 2016.
Appendix A Appendix
a.1 Proof of Lemma 1
Lemma 1. Assuming Assumption 2, is -Bounded, -Lipschitz, and -smooth.
We first write out the complete formula of :
The -Boundedness is straight-forward.
To show the Lipschitzness, we derive :
Note that and both share the properties of , thus, from Assumption 2(a,b), we have:
Next, denoting as , we have :
where the first inequality uses the triangle inequality of a norm; the second inequality uses the smoothness and hessian-Lipschitzness assumptions; the third inequality uses the smoothness assumption.
We are left to prove the last inequality:
where the the first inequality uses the triangle inequality of a norm, and the second inequality uses the smoothness assumption. ∎
a.2 Proof of Lemma 2
Lemma 2. As , and , Assumption 1 gives us:
Note that denotes conditioning on and take expectation w.r.t. .
In Assumption 1(a) we assume for , the linearity of expectation immediately gives us .
To see the second part, we only need to expand as:
Each item of the first part in the last equation can be bounded by according to Assumption 1(b), which leads to a overall upper-bound.
For the second part, we need to use the Mutual Independence assumption (namely Assumption 1(c)):
Use Assumption 1(a) again we know that the above equation equals to 0. This proves part (b) of this lemma. ∎
a.3 Proof of Lemma 3
Lemma 3. Given Assumption 2(d), we have: .