1 Introduction
In recent years, highcapacity machine learning models, such as deep neural networks
[1], have achieved remarkable successes in various domains [2, 3, 4]. However, domains where data is scarce remain a big challenge as those models’ ability to learn and generalize relies heavily on the abundance of training data. In contrast, humans can learn new skills and concepts very efficiently from just a few experiences. This is because when encountering a new task, learning algorithms start completely from scratch; while humans are typically armed with plenty of prior knowledge accumulated from past experience which may share overlapping structures with the current task, and thus can enable efficient learning of the new task.Metalearning [5, 6, 7] was designed to mimic this human ability. A metalearning algorithm is first given a set of metatraining tasks assumed to be drawn from some distribution, and attempts to extract prior knowledge applicable to all tasks in the form of a metalearner. This metalearner is then evaluated on an unseen task, usually assumed to be drawn from a similar distribution as the one for training. Although metalearning has developed rapidly in recent years, it typically assumes all metatraining tasks are available together as a batch, which doesn’t capture the sequential setting of continual lifelong learning in which new tasks are revealed one after another.
Meanwhile, online learning [8] specifically tackles the sequential setting. At each round , the algorithm picks an , and suffers a loss revealed by a potentially adversarial environment. The goal is to minimize the regret, the difference between the cumulative losses suffered by the algorithm and that of any fixed predictor, formally:
(1) 
Yet, online learning sees the whole process as a single task without adaptation for each single step.
Neither paradigm alone is ideal for the continual lifelong learning scenario, thus, Finn et al. [9] proposed to combine them together to construct the Online MetaLearning framework which will be discussed in Section 2. However, this framework has a strong convexity assumption, while many problems of current interest have a nonconvex nature. Thus, in Section 3, we generalize this framework to the nonconvex setting. Section 4 presents an exemplification of our algorithm with rigorous theoretical proofs of its performance guarantee. Real data experiment results are shown in Section 5. In the end, concluding remarks and takeaways are provided in Section 6. To the best of our knowledge, it is the first theoretical regret analysis for nonconvex online metalearning algorithms, shedding the light of applying online metalearning for more challenging learning problems in the paradigm of deep neural networks.
Notation.
We use bold letters to denote vectors, e.g.,
. The th coordinate of a vector is . Unless explicitly noted, we study the Euclidean space with the inner product , and the Euclidean norm. We assume everywhere our objective function is bounded from below and denote the infimum by . The gradient of a function at is .means the expectation w.r.t. the underlying probability distribution of a random variable
.2 Background
Algorithm 1 is the online metalearning framework proposed in [9]. A metalearner is maintained to preserve the prior knowledge learned from past rounds. Upon seeing a new task , one is first given some training data for adapting to the current task following some strategy . Then the test data will be revealed for evaluating the performance of the adapted learner . The loss suffered at this round can then be fed into an online learning algorithm to update . We use following [9] where is a stepsize.
As tasks can be very different, the original regret in Equation (1) of competing with a fixed learner across all tasks becomes less meaningful. Thus, Finn et al. [9] changed it to:
which competes with any fixed metalearner. Under this, they designed the Follow the Meta Leader algorithm enjoying a logarithmic regret when assuming strongconvexity on .
3 Problem Formulation
In this section, we generalize the online metalearning algorithm to nonconvex setting by first demonstrating the infeasibility of regret of form (1) and then introducing an alternative performance measure.
Finding the global minimum for a nonconvex function in general is known to be NPhard. Yet, if we could find an online learning algorithm with a regret for some nonconvex function classes, we can optimize any function of that class efficiently: simply run the online learning algorithm but with the objective as the loss at each round, and choose a random update as output. This gives us:
which leads to a contradiction unless P=NP. Thus, we have to find another performance measure for the nonconvex case. One potential candidate is the local regret proposed by Hazan et al. [10]:
(2) 
where , , and for . The reason for using slidingwindow in , especially a large window, can be justified by Theorem 2.7 in [10].
4 Algorithm & Theoretical Guarantees
4.1 Stochasticity of Online Metalearning Algorithms
In practice,
is typically just a random sample batch of the whole testset, the losses and gradients obtained at each round are thus (unbiased) estimates of the true ones. This is the stochastic setting which we formalize by making following assumptions.
Assumption 1.
We assume that at each round , each call to any stochastic gradient oracle , , yields an i.i.d. random vector with the following properties:

[label=(),topsep=0pt]

;

;

Mutual independence: for ,
where , and denotes the conditional expectation of with respect to . Also note that for .
Hazan et al. proposed a timesmoothed online gradient descent algorithm [10] for such case. Yet, that algorithm’s performance critically relies on the choice of the stepsize , and may even diverge if where is the (often unknown) smoothness of the loss function. We thus propose to use the AdaGradNorm [11] algorithm (Algorithm 2) as the online learning algorithm in Algorithm 1 instead. Here, is the initialization of the accumulated squared norms and prevents division by 0, while is to ensure homogeneity and that the units match.
4.2 Convergence Analysis
We present below an analysis of this algorithm assuming the loss function satisfies:
Assumption 2.
is twice differentiable and :

[label=(),topsep=0pt]

Lipschitz: .

smooth: .
Note that this implies [12, Lemma 1.2.3]:(3) 
HessianLipschitz: .

Bounded:
Assuming Assumption 2 of , we can derive the following properties of (the proof can be found in the Appendix):
Lemma 1.
Assuming Assumption 2, is Bounded, Lipschitz, and smooth.
The following theorem shows that by selecting , a logarithmic regret of the algorithm is guaranteed w.r.t. any .
Theorem 1.
Before showing the proof of Theorem 1, we need the following technical lemmas whose proofs can be found in the Appendix. For simplicity, we denote as condition on and take expectation w.r.t. :
Lemma 2.
As , and , Assumption 1 gives us:

[label=()]


Lemma 3.
Given Assumption 2(d), we have: .
Lemma 4 ([13], Lemma 9).
Let be a nonincreasing function, and for . Then
Proof of Theorem 1.
The proof follows that of Theorem 2.1 in [11].
First, as the average of smooth functions, is also smooth. Using the property in Assumption 2(b) and the update formula (Line 5) in Algorithm 2 we have:
Denote , and take expectation w.r.t. conditioned on (namely ) :
(4)  
(5)  
(6) 
Second, from the definition of and we have:
Using this, and Jensen’s inequality on which is a convex function, we can upperbound Equation (5) by its absolute value which in turn can be upperbounded by:
(7)  
(8) 
Third, by using inequality with , , Equation (7) can be upper bounded by:
where we used that holds for .
Applying again but with , , Equation (8) can be upperbounded by:
Fourth, putting above two inequalities back, and then in turn put the result back into Equation (5) gives us:
Rearrange terms, then for both sides, take expectation w.r.t. and sum from to :
(9)  
(10)  
(11) 
As , letting be in Lemma 4 gives us:
where we used Jensen’s inequality for which is a concave function in .
Since each is Lipschitz, so is , thus, using CauchySchwartz inequality:
(12)  
Putting the above inequality back into Equation (11) and Lemma 3 back into Equation (10), we have:
(13)  
(14) 
Finally, using Markov’s inequality, with probability :
Denote . Using similar derivation in Equation (12), with probability we have:
This means, with probability , we have:
Denote Equation (14) as , and use Markov’s inequality again we have, with probability :
Therefore, with probability :
By solving the above "quadratic" inequality of and letting , we arrive at the end.
∎
5 Experiment
We evaluated our algorithm on the fewshot image classification task of the Omniglot [14] dataset which consists of 20 instances of 1623 characters from 50 different alphabets. The dataset is augmented with rotations by multiples of 90 degrees following [15].
We employed the way shot protocol [7]: at each round, pick unseen characters irrespective of alphabets. Provide the metalearner with different drawings of each of the characters as the training set , then evaluate the adapted model ’s ability on new unseen instances within the classes (namely the test set ). We chose the 5way 5shot scheme, and used 15 samples per character for testing following [16].
The model we used is a CNN following [7]. It contains 4 modules, each of which is a 3
3 convolution with 64 filters followed by batch normalization
[17], a ReLu nonlinearity and 2
2 maxpooling. Images are downsampled to 28
28 so that the resulting feature map of the last hidden layer is 1164. The last layer is fed into a fully connected layer and the loss we used is the CrossEntropy loss.To study if our algorithm provides any empirical benefit over traditional methods, we compare it to two benchmark algorithms [9]: Train on Everything (TOE), and Train from Scratch (TFS). On each round , both initialize a new model. The difference is that TOE trains over all available data, both training and testing, from all past tasks, plus at current round, while TFS only uses for training.
The experiments are performed in PyTorch
[18], and parameters are by default if no specification is provided. For the parameter in the local adapter strategy in Algorithm 1, we set it to be 0.1 everywhere, and the gradient descent step is performed only once for each task. For the AdaGradNorm algorithm (Algorithm 2) we used, we set as suggested in the original paper [11]. The TFS and TOE used Adam [19] with default parameters.The result is shown in Figure 1 which suggests that our algorithm gradually accumulates prior knowledge, which enables fast learning of later tasks. TFS provides a good example of how CNN performs when the training data is scarse. On the contrary, TOE behaves nearly as random guessing. The inferiority of TOE to TFS is somehow surprising, as TOE has much more training data than TFS. The reason is that TOE regards all training data as coming from a single distribution, and tries to learn a model that works for all tasks. Thus, when tasks are substantially different from each other, TOE might even incur negative transfer and fail to solve any single task as has been observed in [20]. Meanwhile, by using training data of the current task only, TFS avoids negative transfer, but also rules out learning of any connection between tasks. Our algorithm, in contrast, is designed to discover common structures across tasks, and use these information to guide fast adaptation to new tasks.
6 Conclusion
The continual lifelong learning problem is common in reallife, where an agent needs to accumulate knowledge from every task it encounters, and utilizes that knowledge for fast learning of new tasks. To solve this problem, we can combine the metalearning and the onlinelearning paradigms to form the online metalearning framework. In this work, we generalized this framework to the nonconvex setting, and introduced the local regret to replace the original regret definition. We applied it to the stochastic setting, and showed its superiority both in theory and practice. In the future work, we would like to evaluate our algorithm on harder learning problems over larger scale datasets such as ImageNet.
References
 [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, Nature, vol. 521, no. 7553, pp. 436, 2015.
 [2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484, 2016.

[3]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi,
“You only look once: Unified, realtime object detection,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 779–788.  [4] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep speech 2: Endtoend speech recognition in English and Mandarin,” in Proceedings of International Conference on Machine Learning, 2016, pp. 173–182.
 [5] Devang K Naik and RJ Mammone, “Metaneural networks that learn by learning,” in Proceedings of International Joint Conference on Neural Networks. IEEE, 1992, vol. 1, pp. 437–442.
 [6] Sebastian Thrun and Lorien Pratt, Learning to learn, Springer Science & Business Media, 2012.
 [7] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al., “Matching networks for one shot learning,” in Proceedings of Advances in Neural Information Processing Systems, 2016, pp. 3630–3638.
 [8] Nicolo CesaBianchi and Gabor Lugosi, Prediction, learning, and games, Cambridge university press, 2006.
 [9] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine, “Online metalearning,” in Proceedings of International Conference on Machine Learning, 2019, pp. 1920–1930.
 [10] Elad E Hazan, Karan Singh, and Cyril Zhang, “Efficient regret minimization in nonconvex games,” in Proceedings of International Conference on Machine Learning, 2017, pp. 2278–2288.
 [11] Rachel Ward, Xiaoxia Wu, and Leon Bottou, “Adagrad stepsizes: sharp convergence over nonconvex landscapes,” in Proceedings of International Conference on Machine Learning, 2019, pp. 6677–6686.
 [12] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, Springer, 2003.

[13]
Xiaoyu Li and Francesco Orabona,
“On the convergence of stochastic gradient descent with adaptive stepsizes,”
inThe 22nd International Conference on Artificial Intelligence and Statistics
, 2019, pp. 983–992.  [14] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum, “Humanlevel concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
 [15] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap, “Metalearning with memoryaugmented neural networks,” in Proceedings of International Conference on Machine Learning, 2016, pp. 1842–1850.
 [16] Sachin Ravi and Hugo Larochelle, “Optimization as a model for fewshot learning,” International Conference on Learning Representations, 2017.
 [17] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of International Conference on Machine Learning, 2015, pp. 448–456.
 [18] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” 2017.
 [19] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2014.

[20]
Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov,
“Actormimic: Deep multitask and transfer reinforcement learning,”
International Conference on Learning Representations, 2016.
Appendix A Appendix
a.1 Proof of Lemma 1
Lemma 1. Assuming Assumption 2, is Bounded, Lipschitz, and smooth.
Proof.
We first write out the complete formula of :
The Boundedness is straightforward.
To show the Lipschitzness, we derive :
Note that and both share the properties of , thus, from Assumption 2(a,b), we have:
Next, denoting as , we have :
where the first inequality uses the triangle inequality of a norm; the second inequality uses the smoothness and hessianLipschitzness assumptions; the third inequality uses the smoothness assumption.
We are left to prove the last inequality:
where the the first inequality uses the triangle inequality of a norm, and the second inequality uses the smoothness assumption. ∎
a.2 Proof of Lemma 2
Lemma 2. As , and , Assumption 1 gives us:

[label=()]


Proof.
Note that denotes conditioning on and take expectation w.r.t. .
In Assumption 1(a) we assume for , the linearity of expectation immediately gives us .
To see the second part, we only need to expand as:
Each item of the first part in the last equation can be bounded by according to Assumption 1(b), which leads to a overall upperbound.
For the second part, we need to use the Mutual Independence assumption (namely Assumption 1(c)):
Use Assumption 1(a) again we know that the above equation equals to 0. This proves part (b) of this lemma. ∎
a.3 Proof of Lemma 3
Lemma 3. Given Assumption 2(d), we have: .
Comments
There are no comments yet.