1 Introduction
In supervised learning, we learn to map features to targets by minimizing a statistical loss averaged over samples from an unknown distribution which is typically associated with a singular task LearnedMiller (2011)
. When this map is a universal function approximator, i.e., a deep neural network (DNN), this framework has yielded successes across a variety of applications
Yin et al. (2017); Gopalakrishnan et al. (2017); Du et al. (2017); Pan et al. (2012). However, its successes have been limited when data is comprised of several qualitatively different regimes, or tasks. To enhance adaptivity to disparate tasks, metalearning seeks to obtain model parameters along the Pareto frontier of the minimizer of many training objectives simultaneously Andrychowicz et al. (2016), and has gained attention for overcoming data starvation issues in robotics and physical systems Finn et al. (2017).Existing approaches, however, offer little guidance about how to select samples on which to train to enable fast convergence, and instead operate via cyclic or random sampling. Doing so is appropriate when disparate tasks are statistically independent. However, in many contexts such as meteorology Racah et al. (2017)
, computer vision, and robotics
Finn et al. (2017), significant relationships between tasks exist. We are then faced with the question of how to incorporate such relationships into the training of a metamodel. In this work, we do so via active sample selection during training metamodels. This active sample selection is executed according to correlation within and across tasks via multiarmed bandits (MAB) Lattimore and Szepesvári (2020) and Markov Decision Processes (MDPs) Puterman (2014) based schedulers, which yields substantial gains in sample efficiency across a variety of experimental settings.Before continuing, a few historical remarks are in order. Augmenting DNN training to improve adaptivity has received substantial interest over the years. Transfer learning relaxes the independent and identically distributed (i.i.d.) hypothesis on data, and seeks to transform a model good for one task to another (domain adaptation)
Tan et al. (2018); Dai et al. (2007), i.e., transfer an understanding of Spanish to Italian Dai et al. (2007). Generative modeling, by contrast, directly estimates the data distribution in order to output new examples that plausibly could have been drawn from the original data, similar in spirit to bootstrapping. Recent advances in parameterizing these models using deep neural network, have enabled scalable modeling of complex, highdimensional data
Shorten and Khoshgoftaar (2019). Both approaches are effective for transferring from one task to another, but it is unclear how to employ these approaches when seeking generalization across many tasks, unless the generative/covariance model coevolves with data drift, which may cause instability Radford et al. (2015).By contrast, metalearning seeks to learn attributes of a problem class which are common to many distinct domains, and has been observed to improve adaptability via explicitly optimizing their fewshot generalization across a set of metatraining tasks Wang et al. (2019). Importantly, doing so enables learning of a new task with as little as a single example Yu et al. (2018); Yin et al. (2019). Metalearning algorithms can be framed in terms of a cost that ties together many training subtasks simultaneously, with, for instance, recurrent or attentionbased models, or an otherwise twostage objective Liu and Vicente (2019): the inner cost defines performance on a single task, and the outer metaobjective tethers performance across tasks. Doing so results in procedures that experimentally have yielded substantial gains in terms of DNN adaptation and generalization to new tasks Rajeswaran et al. (2019).
The aforementioned works, as well as other metalearning objectives, operate under the assumption that training samples are i.i.d. to justify sampling cyclically or randomly. This assumption is invalid for settings involving drift or latent relationships between classes, such as training an NLP system for both Spanish and Italian Peters et al. (2019), image classification of animals from a common genus Wang et al. (2018), or systems identification problems arising in ground robotics when traversing prairie and forest floor Koppel et al. (2016); Chiuso and Pillonetto (2019). Thus, in this work, we propose to build a scheduler on top of the metalearner (Figure 1) to exploit relationships between metatraining data subsets to allocate samples judiciously.
To do so, we incorporate ideas from active learning
Cohn et al. (1996), specifically, selecting a given metalearning training subset, according to either a multiarmed bandit Auer et al. (2002b) or a Markov decision process (MDP) Bellman (1957). Which technique is appropriate depends on whether the statistical accuracy of one task is allowed to be correlated with another. In either case, the state is the weights of a metalearning model, the arm (action) is the index of the specific training task or class label, and the reward is the statistical accuracy of the metamodel on a validation set multiplied by a scaling factor to ensure the reward is stationary. Moreover, regret of a given arm is the scaled average longrun validation accuracy on that metatraining subset.Experimentally, we observe the merit of bandit selections when we employ the Upper Confidence Bound (UCB) or Gittins Index, and MDP policies based upon a linear programming solver De Farias and Van Roy (2003) for metatraining DNNs. In particular, we obtain orders of magnitude improvement in sample complexity when employing our sample selection schemes relative to cyclic or random sampling (Table 1
) for training feedforward multilayer DNNs and convolutional variants on MNIST
Lecun et al. (1998), the real world Extreme Weather dataset Racah et al. (2017), and a metalearning variant of CIFAR100 Krizhevsky (2012). On top of sample efficiency gains, the order of sample selection experimentally can fundamentally improve the limit points to which the metamodel converges.UCB Scheduler  Gittins Index Scheduler  MDP Scheduler  
Digit Recognition  24.5  32.5  / 
Meta CIFAR100 
2.5  3.57  / 
Extreme Weather  1.25  2.42  3.33 
2 Elements of MetaLearning
In supervised learning, we seek to build a predictor
which maps feature vectors
to target variablesby minimizing a loss function
in expectation over the data distribution which is unknown. Heredenotes the parameters of the statistical model (such as a feedforward or convolutional neural network). The loss
quantifies the difference between candidate prediction at an input vector and a target variable , and is small when andare close. For concreteness and clarity, we focus on the case of multiclass classification, an instance of supervised learning, although the ideas developed in this work are also applicable to unsupervised and reinforcement learning. Thus, the space of target variables is of the form
, where is the number of classes. In this context, we wish to compute the parameters that minimize the statistical loss over ,where the expectation is over . In practice, one is given a batch of data , which may be associated with any number of unknown distributions colloquially referred to as tasks. In particular, we have access to distinct training subsets whose union is , and we would like to find a model that simultaneously performs well on each:
(1) 
We consider that each metalearning sample subset is split into a training and a validation set, i.e., with , and that the training subsets for all are used for training within tasks, whereas the validation set is used across tasks. Moreover, we denote and . ^{1}^{1}1For disambiguation, we denote samples of as for . Moreover, we denote as the number of training examples available for task . Throughout, to further alleviate notation, we suppress the dependence of example on class , and instead leave this dependence implicit. Then, we hypothesize that the statistical model
depends on a vector of hyperparameters
, such as the regularizer, the radius of a pooling step in a convolutional neural network, or other architectural considerations. One way to pose the problem of metalearning is as a twostage optimization variant of (1):(2)  
where is again some cost, possibly equal to , which is small when and are close. This formulation yields models which both perform well on individual tasks as quantified by and across tasks through seeking to minimize for all simultaneously. That is, model selection of according to (2) at the innerstage (the constraint evaluation) is decoupled across tasks, whereas at the outer stage, the objective is coupled by hyperparamaters . For connections to bilevel optimization, see Franceschi et al. (2018); Likhosherstov et al. (2020).
Given that computing the simultaneous minimizer of a number of different nonconvex functions is intractable, one may hypothesize that the universal quantifier over task in (2) may be replaced by the sumcosts
(3) 
which presupposes that tasks and classes are statistically independent. Then, because exactly solving the inner optimization problem, i.e., the constraint in (2), is both intractable numerically when is a neural network (as the problem becomes nonconvex) and may lead to solutions that overprioritize a singular task (over fit), one may consider the computational approximation of (2) as Finn et al. (2017)
(4) 
Note that the in the constraint of (2) been substituted in (4) by the fact that we seek model parameters close to the fixed point of the gradient of the taskspecific objective Finn et al. (2017), while also minimizing the cost which is defined across tasks. The spirit of (4) is that we seek model parameters that perform well after a few gradient steps on an unseen task, whereas (1) yields solutions that perform well on average observing a number of samples from a common distribution. Prevailing practice in metalearning is built upon assuming statistical independence between tasks and classes, i.e., writing , which permits grouping the inner and outer expectations – see Fallah et al. (2020).
Main Results In this work, we move beyond the hypothesis that tasks and classes are independent by considering a generalization of (4): rather than focusing on the aggregate taskspecific cost , we retain the taskspecific model fitness in the constraint ,
(5) 
which instead reveals the question of how to compute a point at the intersection of a set of constraints for each of classes when the satisfaction of one constraint influences another. In this work, we focus on sequential approaches to addressing this question, inspired by active learning Cohn et al. (1996); Settles (2011). In particular, we develop techniques to select which among the different tasks and different classes one should execute a training step at any given time such that the overall metalearning performance is optimized expeditiously. Doing so yields significant gains in sample efficiency of training metalearners across a variety of experimental contexts, as we demonstrate in Sec. 4 – see Table 1. Next, we shift to the technical development of bandits and MDPs to this end.
3 Active Sample Selection
In metalearning (5), there are two intertwined challenges. First, to enforce the constraint, one requires access to training examples for each task and class in order to evaluate the gradient of the different taskspecific objectives with respect to model parameters for fixed hyperparameters . With access to for each task, a stochastic gradient update with stepsize is performed:
(6) 
where is some minibatch size, which makes (6) a stochastic gradient step (for ), and we have suppressed dependence on for succinctness. Existing approaches proceed to execute training steps on all tasks and classes cyclically, meaning there are total updates of the form (6) – see Andrychowicz et al. (2016); Finn et al. (2017). Then, we conduct a stochastic gradient update of stepsize with respect to the metamodel:
(7) 
For simplicity, we consider that samples are chosen from validation set to execute a metamodel update in (7).
One way of going beyond statistical independence between tasks in the updates is by using secondorder information Im et al. (2019); Song et al. (2019); Park and Oliva (2019); however, when computing the Hessian of the Lagrangian of (5), its statistical properties are only locally (not globally) informative due to nonconvexity – see Nocedal and Wright (2006). Instead, we directly exploiting covariates within and between tasks. While related ideas have been proposed for how to weight the gradient of the metaobjective in Cai et al. (2020); Simon et al. (2020); Nicholas et al. (2020), none have augmented the update rule both within a task and across tasks.
To do so, we estimate dependencies both within each task and dependencies across different tasks as respectively a multiarmed bandit (MAB) or a Markov Decision Problem (MDP). Before proceeding to defining their specific use in modeling dependencies to more effectively schedule which task one should perform an innerloop update at a given time, we present the generic procedure for concreteness as Algorithm 1, which is depicted graphically in Figure 1. It involves a MAB/MDP scheduler followed by the withintask and crosstask SGD optimization. Next, we define in detail the Scheduler called in Algorithm 1.
3.1 Multiarmed Bandits Scheduling of Subsets
Multiarmed bandits (MAB) encapsulates the setting where we seek to exploit covariates within a task, e.g., how one class is correlated with another. In MAB, at each time , a player (scheduler) selects one among available arms, denoted as (subsequently we abbreviate ), after which a reward is revealed Lattimore and Szepesvári (2020). Since rewards are observed sequentially, under the setting that the underlying generating process of the rewards is stationary, the optimal selection is the one that performs bestinhindsight, i.e., . The performance of any sequential selection strategy for may be quantified as the expected suboptimality, or regret , defined as,
(8) 
Strategies whose timeaverage regret approaches null, as the time horizon becomes large are called noregret. We consider two widelyused MAB noregret algorithms, the UpperConfidence Bound (UCB) Lai and Robbins (1985); Agrawal (1995); Auer et al. (2002a) and Gittins Indices Gittins (1979); Gittins et al. (2011), due to both their simplicity and that they operate upon fairly different principles. Before shifting to describing how is selected for these algorithms, we identify how the structural attributes of MABs are wellsuited to active sampling for metamodels.
In metalearning, for multiclass classification with classes for task , the different possible arms are the classes, i.e., , and the arm pulled at a given time is the class , meaning that one executes a SGD step (6) associated with class . An open question is then how to define the reward . One possibility is the statistical accuracy on the validation set :
(9) 
where the indicator is when the model classifies training example correctly and null otherwise. Observe, however, that as the model and hyperparameters evolve during training, the reward will drift as the validation accuracy improves, which invalidates the stationarity hypothesis (that the distribution in (8) is stationary) underlying the guarantees of UCB and Gittins indices.
To ameliorate this issue, we use the fact that the convergence rate of SGD and its firstorder variants (such as Adam) on nonconvex problems exhibit a convergence rate to a firstorder stationary point in terms of attenuation of the gradient norm Bottou et al. (2018)[Sec. 4.3]. Then, based upon the hypothesis that the rates of attenuation of the gradient norm and the statistical error are comparable, should be constant during training. Thus, we define the reward as
(10) 
Figure 2 shows the errors of some classes in a sample metatraining subset over the first training steps in our MNIST experiment (elaborated upon in Section 4). Observe that of each state is approximately a constant over time, which provides evidence to support our hypothesis, and thus substantiates our choice of reward for linking class selection among performance on training subsets with the metalearning validation objective [cf. (5)]. The values of may increase for larger since the model parameters may settle to the local minima and the error saturates. This is not a problem, however, as later selections influence regret less due to the accumulating sum over time in regret (8). This decrease in importance of later decisions may further be enforced through discounting that arises in UCB, Gittins Indices, and MDPs as described next.
Upper Confidence Bound
Upper Confidence Bound (UCB) operates upon the principle of optimism in the face of uncertainty. Specifically, we initialize the model associated with task via a single iteration of (6) on . Then, we count the number of times has been chosen at time as for each , i.e., and its associated average reward:
Then, UCB selection operates via calibrated perturbation from the sample mean of the reward as
where and are constants that encourage exploration. This procedure is repeated for total steps, and achieves regret that is logarithmic in the total number of steps , which is precisely the withintask minibatch size – see Lai and Robbins (1985).We set the exploration factor . For each hyperparameter update of , a batch of samples are selected from according to those classes from which maximize the upperconfidence bound as determined by Algorithm 2. Then, these samples are used to update the hyperparameters w.r.t. the validation loss in (7).
Gittins Index
UCB is a frequentist (nonBayesian) strategy: it does not construct any distributional model for how to select . Next we consider a Bayesian approach based upon Gittins Index, which may also be shown to be no regret Gittens and Dempster (1979). It has the additional merit that it exploits the Markovian dependencies between states by the transition matrix structure. Proceeding with its technical development necessitates a distributional model among states. For task , we construct the countbased measure:
(11) 
This countingbased construction of the transition matrix between classes in has precedent in Bayesian filtering Krishnamurthy (2016)[Ch. 5]. Gittins index is then defined as
(12) 
where is a measurable stopping time. Here is called Gittins index associated with reward at state , and the expectation is computed with respect to the distribution over labels for a fixed . We define the Gittins index identically as (12) for each metatraining subset as .
The Gittins Index Theorem establishes that a selection is optimal, i.e., no regret (8), if and only if it always selects an arm with highest Gittins index when there is Markovian dependence on the way label transitions occur Gittens and Dempster (1979), with (10) as the reward. To investigate whether this condition holds true, we use Pearson’s chisquared test to determine whether the evidence supports the examples are not i.i.d. at 95% confident level (significant level (pvalue) of 0.05). Further details and validation of the constructed transition matrices is deferred to Appendix A. In the experimental settings of Sec 4, there is significant evidence that classes exhibit Markovian dependence.
Since the reward is a constant for each class (state), based on equation (10), we approximate the reward of state in as the accuracy of fitting the first sample of label in into the initial model. The reward vector of is then . We use largestremainingindex algorithm Varaiya et al. (1985) to compute the Gittins Index of each label in each metalearning subset (See Appendix B). The Gittins Index Theory Gittens and Dempster (1979) states that the optimal action is to choose the bandit with highest Gittins Index at each iteration. Gittins indices are computed offline before the actual training process. Gittins Index scheduler is shown in Algorithm 3.
3.2 MDPs for CrossCorrelated Task Scheduling
In MAB, arms are assumed independent from one another in UCB and Gittins index and correlation across tasks is not permitted. However, in many applications of metalearning, dependencies across different training subsets exist. In such a setting, the reward for arm will not remain frozen when arm
is chosen. To address this limitation, we consider using MDPs, where transition probabilities and reward functions are defined across subsets (arms)
and .An MDP over state space and action space is one in which, starting from state , and selecting action , one moves to state with probability . Then, a reward is revealed. The canonical objective of an MDP is to select actions so as to maximize the average cumulative return, or value, defined as , where is the horizon length and is a discount factor. It’s wellknown that the optimal value function satisfies Bellman’s optimality equation Puterman (2014):
(13) 
The optimal policy for each state is the action corresponding to the maximum value:
(14) 
The optimal policy is timehomogeneous, i.e., assigns a fixed action to any state independent of time for . One way to obtain the optimal policy for tabular settings, i.e., when the state and action spaces are discrete and of moderate cardinality, when the transition matrix is available [cf. (11)] is via linear programming (LP) De Farias and Van Roy (2003).
We proceed to formulate this LP for the metalearning scheduler policy. The state space is vectorvalued consisting of the fold Cartesian product of the set of classes , the aggregate transition model is the fold Kronecker product of taskspecific transition matrix (11), i.e., . The Kronecker product ensures the dimensionality consistence between state space and the transition model . The action determines which metatraining subset should be chosen at the next training timeslots. Moreover, the reward is given as the validation accuracy (10), as in the beginning of Sec. 3.1, except now we reinterpret the reward as being not only a function of the selected class but also the metalearning subset as well, i.e., . This is the additional expressive power of MDPs over Gittins Index. In MDPs, the reward for the same state changes when different arms are played, which exploits both within and crosstask correlation. Then, we formulate an LP to solve for the optimal value :
(15) 
. The optimal policy is computed by equation (14), where is obtained from the optimal solution in LP (15). The MDP scheduler is shown in Algorithm 4. With our various active selection schemes defined, we shift to establishing their experimental merits for improving the training of metamodels across a variety of problem contexts.
4 Experiment
We experiment the proposed MAB/MDP scheduler on three datasets with either explicit or inexplicit sample dependencies within and cross tasks. Across all experiments, we observe significant relative sample efficiency gain compared to basic cyclic sampling, demonstrating the merit of exploiting covariates in practice.
Digit Recognition
We first evaluate the performance of the schedulers on MNIST handwritten digits LeCun (1998) – MNIST forms the validation set , and the taskspecific subsets are the related Optical Recognition Xu et al. (1992) and Semeion Handwritten Digit data sets Buscema (1998) – see Appendix C for additional details.
In crosstask , We select multinomial logistic as the loss , and in task specific , crossentropy is selected as lss Murphy (2012). The specific model is a fourlayer fullyconnected neural network with 300 nodes per layer, and the hyperparameters concatenates the inner objective’s (the constraint in (5)) learning rate and the initialization . We use Adam Kingma and Ba (2014) with decaying learning rate as outer objective optimizer.
To evaluate the performance, we vary the batch size . We compare UCB (Algorithm 2), Gittins Index (Algorithm 3), and cyclic sampling from all subsets, where one simply passes through rows of training data one after another. Results are given in Figure 3. Because there are no strong inner dependencies between examples in MNIST dataset, Gittins index algorithm does not exhibit significant gains compared to UCB. However, both active schedulers outperform the cyclic sampling: to obtain test accuracy 80%, Gittins index requires 40 samples as compared with 53 for UCB sampling and 1300 for cyclic from test data.
UBOT  TMQ  U850  V850  VBOT  Z100  
MDP  0.901  0.873  0.917  0.870  0.774  0.842 
Gittins Index  0.904  0.836  0.845  0.653  0.738  0.877 
UCB  0.673  0.649  0.684  0.421  0.600  0.619 
Cyclic  0.352  0.043  0.304  0.480  0.592  0.448 
MetaCIFAR100
The CIFAR100 dataset is an image dataset containing 100 classes with 600 images each Krizhevsky (2009). We construct 4 taskspecific metatraining subsets: each task is associated with a superclass, that is, we form metatraining subsets consisting entirely of a single superclass. This defines a classification problem associated with those classes within it – see Appendix C.
We use cross entropy as both the inner and outer loss functions and employ a fourlayer CNNs with strided convolutions and 64 filters per layer. The hyperparameters are the same as in the Digit Recognition – see Appendix
C.Figure 4 shows the result of using Gittins Index and UCB compared with cyclic sampling. Note the significant improvements in sample efficiency and the superior limit point to which the model converges when using active selection as compared with cyclic passes through taskspecific samples. Moreover, Gittins index outperforms UCB, which is evidence that inherent correlation in the class and task structure is more pronounced for this setting. To achieve 40% accuracy, Gittins Index scheduler requires 1400 samples, while UCB requires 2000 samples and cyclic scheduler needs 5000 samples, meaning they are respectively and more efficient than cyclic sampling.
Extreme Weather
Gittins index, as compared to UCB, employs the Markovian transition matrix [cf. (11)] to select the next sample (12), and thus leverages dependencies between classes. In principle, the merit of modeling correlations may be greater when the order of the data has physical meaning. This is not obvious in the case for Meta CIFAR100 and Digit Recognition. To further investigate the merit of exploiting covariates between samples, we focus on an instance arising in meteorology, as the physical meaning of ordering is inherent due to, e.g., the water cycle.
Data Preparation We consider the Extreme Weather Dataset Racah et al. (2017): training data consists of image patterns of various features and the bounding boxes (prescribed regions) on the images label a specific extreme weather type (considered as class). We use various bounding boxes with different features to construct the meta training, validation and test sets – see Appendix C for details.
Result Our results are summarized in Table 2 and Figure 5. In Appendix D, one may observe that the constructed transition matrices are diagonally dominant, meaning that covariates between neighboring events/classes are more significant. Thus, it is no surprise that in Table 2, one may observe that MDP and Gittins index schedulers outperform other two scheduling policies in all experiments, as they are designed to exploit correlation. Mostly, MDP outperforms Gittins Index, showing that crosstask covariates also have obvious positive effect during training; whereas in some cases, UCB performs comparably to periodic sampling.
We also compare our results with Liu et al. (2016), which uses a CNN with hyperparameter optimization to perform the binary classifications on different weather events using multiple features. We use similar features as Liu et al. (2016) described, but with single feature in each test. Although the accuracy we obtain is not comparable, we obtain moderate accuracy with a much simpler correlation model. Specifically, with only 5000 fivefeatures images of size of 32 32, which is 90% fewer examples than Liu et al. (2016), we achieve 7090% of the accuracy. Moreover, we focus on multiclass problems, which is significantly more challenging than binary classification. Thus, MDPs and Gittins Index schedulers can significantly improve training efficiency. See Appendix D for further details.
5 Conclusion
We departed from prior works on metalearning that presume independence between tasks by directly considering within and acrosstask correlation. We proposed a module to select samples according to their contribution to metamodel validation accuracy, which yielded significant sample efficiency gains across a variety of domains as compared to cyclic passes through data. Rigorously analyzing these sample efficiency gains is the subject of future work.
References
 Sample mean based index policies with o (log n) regret for the multiarmed bandit problem. Advances in Applied Probability, pp. 1054–1078. Cited by: §3.1.
 Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §1, §3.
 Finitetime analysis of the multiarmed bandit problem. Machine learning 47 (23), pp. 235–256. Cited by: §3.1.
 Finitetime analysis of the multiarmed bandit problem. Machine Learning 47 (2), pp. 235–256. External Links: Document, ISBN 15730565, Link Cited by: §1.
 A markovian decision process. Indiana Univ. Math. J. 6, pp. 679–684. External Links: ISSN 00222518 Cited by: §1.
 Optimization methods for largescale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §3.1.
 Metanet*: the theory of independent judges. Substance use & misuse 33 (2), pp. 439–461. Cited by: Figure 3, §4.
 Weighted metalearning. arXiv preprint arXiv:2003.09465. Cited by: §3.
 System identification: a machine learning perspective. Annual Review of Control, Robotics, and Autonomous Systems 2, pp. 281–304. Cited by: §1.

Active learning with statistical models.
Journal of artificial intelligence research
4, pp. 129–145. Cited by: §1, §2.  Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, New York, NY, USA, pp. 193–200. External Links: ISBN 9781595937933, Link, Document Cited by: §1.
 The linear programming approach to approximate dynamic programming. Operations research 51 (6), pp. 850–865. Cited by: §1, §3.2.
 Fused dnn: a deep neural network fusion approach to fast and robust pedestrian detection. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 953–961. Cited by: §1.
 On the convergence theory of gradientbased modelagnostic metalearning algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1082–1092. Cited by: §2.
 Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: §1, §1, §2, §3.
 Bilevel programming for hyperparameter optimization and metalearning. ICML 2018. External Links: Document Cited by: §2.
 Bandit processes and dynamic allocation indices [with discussion]. Journal of the Royal Statistical Society. Series B: Methodological 41, pp. 148–177. External Links: Document Cited by: §3.1, §3.1, §3.1.
 Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological) 41 (2), pp. 148–164. Cited by: §3.1.
 Multiarmed bandit allocation indices. John Wiley & Sons. Cited by: §3.1.
 Deep convolutional neural networks with transfer learning for computer visionbased datadriven pavement distress detection. Construction and Building Materials 157, pp. 322 – 330. External Links: ISSN 09500618, Document, Link Cited by: §1.
 Modelagnostic metalearning using rungekutta methods. arXiv preprint arXiv:1910.07368. Cited by: §3.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
 Online learning for characterizing unknown environments in ground robotic vehicle models. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 626–633. Cited by: §1.
 Partially observed markov decision processes. Cambridge University Press. Cited by: §3.1.
 Learning multiple layers of features from tiny images. Master’s thesis, University of Tront. Cited by: §C.2, §4.
 Learning multiple layers of features from tiny images. University of Toronto, pp. . Cited by: §1.
 Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6 (1), pp. 4–22. External Links: ISSN 01968858, Link, Document Cited by: §3.1, §3.1.
 Bandit algorithms. Cambridge University Press. Cited by: §1, §3.1.
 Supervised learning and bayesian classification. External Links: Link Cited by: §1.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
 The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §4.
 UFOblo: unbiased firstorder bilevel optimization. arXiv preprint arXiv:2006.03631. Cited by: §2.
 The stochastic multigradient algorithm for multiobjective optimization and its application to supervised machine learning. arXiv preprint arXiv:1907.04472. Cited by: §1.
 Application of deep convolutional neural networks for detecting extreme weather in climate datasets. CoRR abs/1605.01156. External Links: Link, 1605.01156 Cited by: §4.
 Machine learning: a probabilistic perspective. Cited by: §4.

M2SGD: learning to learn important weights.
In
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
, pp. 957–964. Cited by: §3.  Numerical optimization. Springer Science & Business Media. Cited by: §3.
 Investigation of deep neural networks (dnn) for large vocabulary continuous speech recognition: why dnn surpasses gmms in acoustic modeling. In 2012 8th International Symposium on Chinese Spoken Language Processing, Vol. , pp. 301–305. Cited by: §1.
 Metacurvature. In Advances in Neural Information Processing Systems, pp. 3314–3324. Cited by: §3.
 To tune or not to tune? adapting pretrained representations to diverse tasks. ACL 2019, pp. 7. Cited by: §1.
 Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §1, §3.2.
 ExtremeWeather: a largescale climate dataset for semisupervised detection, localization, and understanding of extreme weather events. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3405–3416. External Links: Link Cited by: §C.3, §1, §1, §4.

Unsupervised representation learning with deep convolutional generative adversarial networks
. External Links: 1511.06434 Cited by: §1.  Metalearning with implicit gradients. External Links: 1909.04630 Cited by: §1.
 From theories to queries: active learning in practice. In Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, pp. 1–18. Cited by: §2.

A survey on image data augmentation for deep learning
. Journal of Big Data 6 (1), pp. 60. Cited by: §1.  On modulating the gradient for metalearning. Cited by: §3.
 ESmaml: simple hessianfree meta learning. In International Conference on Learning Representations, Cited by: §3.
 A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning – ICANN 2018, V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Eds.), Cham, pp. 270–279. External Links: ISBN 9783030014247 Cited by: §1.
 Extensions of the multiarmed bandit problem: the discounted case. IEEE Transactions on Automatic Control 30 (5), pp. 426–439. Cited by: §3.1.
 Paired openended trailblazer (poet): endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753. Cited by: §1.
 Large margin metalearning for fewshot classification. In Neural Information Processing Systems (NIPS) Workshop on MetaLearning, Montreal, Canada, Cited by: §1.
 Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE transactions on systems, man, and cybernetics 22 (3), pp. 418–435. Cited by: Figure 3, §4.
 Metalearning without memorization. External Links: 1912.03820 Cited by: §1.

Comparative study of cnn and rnn for natural language processing
. External Links: 1702.01923 Cited by: §1.  Oneshot imitation from observing humans via domainadaptive metalearning. arXiv preprint arXiv:1802.01557. Cited by: §1.
Supplementary Material for
“A Markov Decision Process Approach to Active Meta Learning”
In the supplementary material, we provide additional details regarding the construction of metalearning tasks and evaluations, the associated data sets, and quantities constructed toward these ends.
Appendix A Determine Sample Dependencies in Metatraining Subsets Using Chisquared Test
First, we focus on the statistical validation of the transition matrices constructed as (11) for the various data sets. These transition matrices are essential to the constructing Gittins Index (12) and the policy associated with an MDPs (15). Our goal here is to determine whether the constructed transition matrices provide evidence that classes and tasks exhibit any significant correlation effects.
To do so, we use the Pearson’s ChiSquared to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies at the 95% confident level, i.e., p value of 0.05. The null hypothesis is samples are i.i.d. in each subset. If the statistical test rejects the null hypothesis, i.e., pvalue
0.05, Gittins Index or MDPs are justified for scheduling. Under independence, the rows of the constructed Markov chain induced by the transition matrix
are identical for a fixed . Table 3 shows the pvalues of metatraining subsets in MNIST and meta CIFAR100 experiments. The pvalues of subsets in Extreme Weather experiment are all nearly 0.


This provides substantial evidence across the different data domains that classes and tasks exhibit Markovian dependence, which is evidence that exploiting correlation effects may be useful for scheduling.
Appendix B Largestremainingindex Algorithm for Gittins Index in Meta Learning
We use largestremainingindex algorithm to compute the Gittins Index of each state (class) in each metalearning subset . We elaborate upon how this procedure works next. Suppose the state space for a given subset is . First step is to identify state (class) with the highest Gittins index:
Next step is the recursion to find state with th largest Gittins index. Define continuation set as and stopping set as . Then state and its associated Gittins Index can be computed using a matrix and two vectors , which are shown in detail in Algorithm 5. This procedure is then used in the Gittins Index based scheduler summarized in Algorithm 3.
Appendix C Additional Details of Experiments
We elaborate upon the metalearning problem formulation in terms of data preparation and allocation, parameter selection, loss function specification, etc. for the experimental results presented in Section 4. These points are collated into Table 4 for convenience.
Metatraining subsets  Withintask loss  Crosstask loss  Neural net  Hyperparameters  
Digit Recognition 

Crossentropy  Multinomial logistic 



Meta CIFAR100 

Crossentropy  Crossentropy 



Extreme Weather 

Crossentropy  Crossentropy 


c.1 Digit Recognition
We construct metatraining subsets with samples per set. Two are selected from Semeion dataset, and the data from the other three sets are from Optical Recognition Dataset. We construct a common validation set with size 1400 from the two datasets above to evaluate the performance after each hyper iteration. The performance of this procedure is evaluated on a test set comprised of 60000 samples from MNIST dataset. The size of digit images from Optical Recognition dataset and Semeion dataset is different from the size of MNIST images. So we resize the traning and validation image to in order to ensure images have compatible dimensionality.
c.2 Meta CIFAR100
The CIFAR100 dataset is an image dataset containing 100 classes with 600 images each Krizhevsky (2009). There are 500 training images and 100 testing images per class. The 100 classes are grouped into 20 superclasses, each of which contains classes. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). We construct the taskspecific subsets where each task is associated with a superclass, that is, we form data sets consisting entirely of a single superclass, which defines a classification problem associated with those classes within it. Superclasses consist of “aquatic mammals”, “mediumsized mammals”, “small mammals” and “insect.” Then, we use the superclass “large carnivores” as the crosstask validation set. This construction we call MetaCIFAR100.
c.3 Extreme Weather
We consider the Extreme Weather Dataset Racah et al. (2017), where samples from both climate simulations and reanalysis are considered. The reanalysis samples are generated by assimilating observations into a climate model. Ground truth labeling of various events is obtained via multivariate threshold based criteria implemented in TECA, and manual labeling by experts Racah et al. (2017). Training data consists of image patterns, where several relevant spatial variables are stacked together over a prescribed region (called bounding box) that bounds a type of weather event, which is considered as ground truth label. The dimension of the bounding box is based domain knowledge of events observed in the real word. There are 1460 example images (4 per day, 365 days in the year) arranged in time order for each year’s dataset. We only used 2005’s dataset for the experiment. Each image has 16 channels corresponding to 16 features. Each channel is 768 x 1152 corresponding to one measurement per 25 square km on earth.
We first build the Meta training subsets. For each image, there are up to 15 bounding boxes, where each box indicates a prescribed region in the image that bounds a type of extreme weather event. We used these bounding boxes to split the dataset into different subsets of metatraining set. The first box of each image forms the first subset, the second boxes form the second subset, and so on. Only the first 5 boxes of each image are used, so in total we have 5 different tasks. In order to better differentiate tasks, each subset uses different 5 among 16 features and the features used in each subset are not identical. The first five bounding boxes forms the 5 subsets with 500 images each, another 50 images with all bounding boxes and 5 features are used for validation and other images with all bounding boxes with only one feature are used for testing. Because of the spatial dimension of climate events vary significantly and the spatial resolution of source data is nonuniform, the bounding boxes are resized to 32 32.
Appendix D Additional Result of Extreme Weather Experiment
We present a sample transition matrix of the taskspecific data subset via (11) below:
The transition matrix is diagonaldominant which means that the examples in the dataset are highly correlated. The same type of weather event or its neighbor type of event are likely to happen after one type of extreme weather happens. Combining this structure of likelihood with reward vectors obtained, which are the initial validation accuracy, the Gittins Index reflects the relative ”importance” of each state in each arm during the training process. Following the Gittins Index policy we can find the optimal stopping time on one metatraining set and the next dataset the ML model should learn.
Table 5 displays the summary of examples used in each meta training subset to train the ML model using different schedulers, and feature U850 in test set. Observe that for MDP and Gittins Index scheduler, each metatraining subset contributes to training different types of weather events while training set 4 is rarely scheduled, which indicates that it contributes little towards validation performance for any of type of events. This filtering out of irrelevant information makes training the metalearner more efficient. The overall classification accuracy for each weather type at the end of training is summarized in Table 6. Since the schedulers select more samples labeled as Tropical Cyclone and Extratropic Cyclone, the classification accuracy on these weather types are higher in general.



Trop. Depression  Trop. Cyclone  Extratropic Cyclone  Atmo. River  
MDP  0.789  0.961  0.947  0.658 
Gittins Index  0.421  0.836  0.963  0.395 
UCB  0.368  0.698  0.788  0.421 