A Markov Decision Process Approach to Active Meta Learning

by   Bingjia Wang, et al.
cornell university

In supervised learning, we fit a single statistical model to a given data set, assuming that the data is associated with a singular task, which yields well-tuned models for specific use, but does not adapt well to new contexts. By contrast, in meta-learning, the data is associated with numerous tasks, and we seek a model that may perform well on all tasks simultaneously, in pursuit of greater generalization. One challenge in meta-learning is how to exploit relationships between tasks and classes, which is overlooked by commonly used random or cyclic passes through data. In this work, we propose actively selecting samples on which to train by discerning covariates inside and between meta-training sets. Specifically, we cast the problem of selecting a sample from a number of meta-training sets as either a multi-armed bandit or a Markov Decision Process (MDP), depending on how one encapsulates correlation across tasks. We develop scheduling schemes based on Upper Confidence Bound (UCB), Gittins Index and tabular Markov Decision Problems (MDPs) solved with linear programming, where the reward is the scaled statistical accuracy to ensure it is a time-invariant function of state and action. Across a variety of experimental contexts, we observe significant reductions in sample complexity of active selection scheme relative to cyclic or i.i.d. sampling, demonstrating the merit of exploiting covariates in practice.


page 1

page 2

page 3

page 4


Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algo...

Meta Learning Black-Box Population-Based Optimizers

The no free lunch theorem states that no model is better suited to every...

A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

One popular trend in meta-learning is to learn from many training tasks ...

Meta Learning MDPs with Linear Transition Models

We study meta-learning in Markov Decision Processes (MDP) with linear tr...

Universal Policies for Software-Defined MDPs

We introduce a new programming paradigm called oracle-guided decision pr...

From Learning to Meta-Learning: Reduced Training Overhead and Complexity for Communication Systems

Machine learning methods adapt the parameters of a model, constrained to...

Before we can find a model, we must forget about perfection

With Reinforcement Learning we assume that a model of the world does exi...

1 Introduction

In supervised learning, we learn to map features to targets by minimizing a statistical loss averaged over samples from an unknown distribution which is typically associated with a singular task Learned-Miller (2011)

. When this map is a universal function approximator, i.e., a deep neural network (DNN), this framework has yielded successes across a variety of applications

Yin et al. (2017); Gopalakrishnan et al. (2017); Du et al. (2017); Pan et al. (2012). However, its successes have been limited when data is comprised of several qualitatively different regimes, or tasks. To enhance adaptivity to disparate tasks, meta-learning seeks to obtain model parameters along the Pareto frontier of the minimizer of many training objectives simultaneously Andrychowicz et al. (2016), and has gained attention for overcoming data starvation issues in robotics and physical systems Finn et al. (2017).

Existing approaches, however, offer little guidance about how to select samples on which to train to enable fast convergence, and instead operate via cyclic or random sampling. Doing so is appropriate when disparate tasks are statistically independent. However, in many contexts such as meteorology Racah et al. (2017)

, computer vision, and robotics

Finn et al. (2017), significant relationships between tasks exist. We are then faced with the question of how to incorporate such relationships into the training of a meta-model. In this work, we do so via active sample selection during training meta-models. This active sample selection is executed according to correlation within and across tasks via multi-armed bandits (MAB) Lattimore and Szepesvári (2020) and Markov Decision Processes (MDPs) Puterman (2014) based schedulers, which yields substantial gains in sample efficiency across a variety of experimental settings.

Before continuing, a few historical remarks are in order. Augmenting DNN training to improve adaptivity has received substantial interest over the years. Transfer learning relaxes the independent and identically distributed (i.i.d.) hypothesis on data, and seeks to transform a model good for one task to another (domain adaptation)

Tan et al. (2018); Dai et al. (2007), i.e., transfer an understanding of Spanish to Italian Dai et al. (2007)

. Generative modeling, by contrast, directly estimates the data distribution in order to output new examples that plausibly could have been drawn from the original data, similar in spirit to bootstrapping. Recent advances in parameterizing these models using deep neural network, have enabled scalable modeling of complex, high-dimensional data

Shorten and Khoshgoftaar (2019). Both approaches are effective for transferring from one task to another, but it is unclear how to employ these approaches when seeking generalization across many tasks, unless the generative/covariance model co-evolves with data drift, which may cause instability Radford et al. (2015).

By contrast, meta-learning seeks to learn attributes of a problem class which are common to many distinct domains, and has been observed to improve adaptability via explicitly optimizing their few-shot generalization across a set of meta-training tasks Wang et al. (2019). Importantly, doing so enables learning of a new task with as little as a single example Yu et al. (2018); Yin et al. (2019). Meta-learning algorithms can be framed in terms of a cost that ties together many training sub-tasks simultaneously, with, for instance, recurrent or attention-based models, or an otherwise two-stage objective Liu and Vicente (2019): the inner cost defines performance on a single task, and the outer meta-objective tethers performance across tasks. Doing so results in procedures that experimentally have yielded substantial gains in terms of DNN adaptation and generalization to new tasks Rajeswaran et al. (2019).

Figure 1: Our scheduler selects which samples from training subsets to execute task-specific updates to ensure the meta-model’s performance improves as rapidly as possible as quantified by meta-training subsets’ contribution to the meta-model’s validation accuracy. Doing so requires a novel definition of the reward in multi-armed bandits or MDPs.

The aforementioned works, as well as other meta-learning objectives, operate under the assumption that training samples are i.i.d. to justify sampling cyclically or randomly. This assumption is invalid for settings involving drift or latent relationships between classes, such as training an NLP system for both Spanish and Italian Peters et al. (2019), image classification of animals from a common genus Wang et al. (2018), or systems identification problems arising in ground robotics when traversing prairie and forest floor Koppel et al. (2016); Chiuso and Pillonetto (2019). Thus, in this work, we propose to build a scheduler on top of the meta-learner (Figure 1) to exploit relationships between meta-training data subsets to allocate samples judiciously.

To do so, we incorporate ideas from active learning

Cohn et al. (1996), specifically, selecting a given meta-learning training subset, according to either a multi-armed bandit Auer et al. (2002b) or a Markov decision process (MDP) Bellman (1957). Which technique is appropriate depends on whether the statistical accuracy of one task is allowed to be correlated with another. In either case, the state is the weights of a meta-learning model, the arm (action) is the index of the specific training task or class label, and the reward is the statistical accuracy of the meta-model on a validation set multiplied by a scaling factor to ensure the reward is stationary. Moreover, regret of a given arm is the scaled average long-run validation accuracy on that meta-training subset.

Experimentally, we observe the merit of bandit selections when we employ the Upper Confidence Bound (UCB) or Gittins Index, and MDP policies based upon a linear programming solver De Farias and Van Roy (2003) for meta-training DNNs. In particular, we obtain orders of magnitude improvement in sample complexity when employing our sample selection schemes relative to cyclic or random sampling (Table 1

) for training feedforward multi-layer DNNs and convolutional variants on MNIST

Lecun et al. (1998), the real world Extreme Weather dataset Racah et al. (2017), and a meta-learning variant of CIFAR100 Krizhevsky (2012). On top of sample efficiency gains, the order of sample selection experimentally can fundamentally improve the limit points to which the meta-model converges.

UCB Scheduler Gittins Index Scheduler MDP Scheduler
Digit Recognition 24.5 32.5 /

Meta CIFAR-100

2.5 3.57 /
Extreme Weather 1.25 2.42 3.33
Table 1: Relative sample efficiency gain compared to baseline cyclic sampling on different experiments.

2 Elements of Meta-Learning

In supervised learning, we seek to build a predictor

which maps feature vectors

to target variables

by minimizing a loss function

in expectation over the data distribution which is unknown. Here

denotes the parameters of the statistical model (such as a feedforward or convolutional neural network). The loss

quantifies the difference between candidate prediction at an input vector and a target variable , and is small when and

are close. For concreteness and clarity, we focus on the case of multi-class classification, an instance of supervised learning, although the ideas developed in this work are also applicable to unsupervised and reinforcement learning. Thus, the space of target variables is of the form

, where is the number of classes. In this context, we wish to compute the parameters that minimize the statistical loss over ,

where the expectation is over . In practice, one is given a batch of data , which may be associated with any number of unknown distributions colloquially referred to as tasks. In particular, we have access to distinct training subsets whose union is , and we would like to find a model that simultaneously performs well on each:


We consider that each meta-learning sample subset is split into a training and a validation set, i.e., with , and that the training subsets for all are used for training within tasks, whereas the validation set is used across tasks. Moreover, we denote and . 111For disambiguation, we denote samples of as for . Moreover, we denote as the number of training examples available for task . Throughout, to further alleviate notation, we suppress the dependence of example on class , and instead leave this dependence implicit. Then, we hypothesize that the statistical model

depends on a vector of hyperparameters

, such as the regularizer, the radius of a pooling step in a convolutional neural network, or other architectural considerations. One way to pose the problem of meta-learning is as a two-stage optimization variant of (1):


where is again some cost, possibly equal to , which is small when and are close. This formulation yields models which both perform well on individual tasks as quantified by and across tasks through seeking to minimize for all simultaneously. That is, model selection of according to (2) at the inner-stage (the constraint evaluation) is decoupled across tasks, whereas at the outer stage, the objective is coupled by hyperparamaters . For connections to bilevel optimization, see Franceschi et al. (2018); Likhosherstov et al. (2020).

Given that computing the simultaneous minimizer of a number of different non-convex functions is intractable, one may hypothesize that the universal quantifier over task in (2) may be replaced by the sum-costs


which presupposes that tasks and classes are statistically independent. Then, because exactly solving the inner optimization problem, i.e., the constraint in (2), is both intractable numerically when is a neural network (as the problem becomes non-convex) and may lead to solutions that over-prioritize a singular task (over fit), one may consider the computational approximation of (2) as Finn et al. (2017)


Note that the in the constraint of (2) been substituted in (4) by the fact that we seek model parameters close to the fixed point of the gradient of the task-specific objective Finn et al. (2017), while also minimizing the cost which is defined across tasks. The spirit of (4) is that we seek model parameters that perform well after a few gradient steps on an unseen task, whereas (1) yields solutions that perform well on average observing a number of samples from a common distribution. Prevailing practice in meta-learning is built upon assuming statistical independence between tasks and classes, i.e., writing , which permits grouping the inner and outer expectations – see Fallah et al. (2020).

Main Results In this work, we move beyond the hypothesis that tasks and classes are independent by considering a generalization of (4): rather than focusing on the aggregate task-specific cost , we retain the task-specific model fitness in the constraint ,


which instead reveals the question of how to compute a point at the intersection of a set of constraints for each of classes when the satisfaction of one constraint influences another. In this work, we focus on sequential approaches to addressing this question, inspired by active learning Cohn et al. (1996); Settles (2011). In particular, we develop techniques to select which among the different tasks and different classes one should execute a training step at any given time such that the overall meta-learning performance is optimized expeditiously. Doing so yields significant gains in sample efficiency of training meta-learners across a variety of experimental contexts, as we demonstrate in Sec. 4 – see Table 1. Next, we shift to the technical development of bandits and MDPs to this end.

Initialize: No. tasks blah , task-specific data , , validation set , init. params. associated w/ hyperparams. , batch size
for  do
       for  do
             Schedule mini-batch
             Update parameters via SGD [cf. (6)]
       end for
      Update hyperparams. of meta-model [cf. (7)]
end for
return Meta-model params. , hyperparams.
Algorithm 1 Active Learning for Meta Learning

3 Active Sample Selection

In meta-learning (5), there are two intertwined challenges. First, to enforce the constraint, one requires access to training examples for each task and class in order to evaluate the gradient of the different task-specific objectives with respect to model parameters for fixed hyperparameters . With access to for each task, a stochastic gradient update with step-size is performed:


where is some mini-batch size, which makes (6) a stochastic gradient step (for ), and we have suppressed dependence on for succinctness. Existing approaches proceed to execute training steps on all tasks and classes cyclically, meaning there are total updates of the form (6) – see Andrychowicz et al. (2016); Finn et al. (2017). Then, we conduct a stochastic gradient update of step-size with respect to the meta-model:


For simplicity, we consider that samples are chosen from validation set to execute a meta-model update in (7).

One way of going beyond statistical independence between tasks in the updates is by using second-order information Im et al. (2019); Song et al. (2019); Park and Oliva (2019); however, when computing the Hessian of the Lagrangian of (5), its statistical properties are only locally (not globally) informative due to non-convexity – see Nocedal and Wright (2006). Instead, we directly exploiting covariates within and between tasks. While related ideas have been proposed for how to weight the gradient of the meta-objective in Cai et al. (2020); Simon et al. (2020); Nicholas et al. (2020), none have augmented the update rule both within a task and across tasks.

To do so, we estimate dependencies both within each task and dependencies across different tasks as respectively a multi-armed bandit (MAB) or a Markov Decision Problem (MDP). Before proceeding to defining their specific use in modeling dependencies to more effectively schedule which task one should perform an inner-loop update at a given time, we present the generic procedure for concreteness as Algorithm 1, which is depicted graphically in Figure 1. It involves a MAB/MDP scheduler followed by the within-task and cross-task SGD optimization. Next, we define in detail the Scheduler called in Algorithm 1.

Figure 2: Scaled on MNIST is nearly constant for each class (state) as a function of within-task training index . Thus, via the approximate relationship between the rate of attenuation of the expected gradient of the meta-training objective and validation error during within-task training, we can define a reward which is time-invariant, and hence satisfies the conditions required for a valid bandit formulation in the sense that the distribution in (8) is stationary.

3.1 Multi-armed Bandits Scheduling of Subsets

Multi-armed bandits (MAB) encapsulates the setting where we seek to exploit covariates within a task, e.g., how one class is correlated with another. In MAB, at each time , a player (scheduler) selects one among available arms, denoted as (subsequently we abbreviate ), after which a reward is revealed Lattimore and Szepesvári (2020). Since rewards are observed sequentially, under the setting that the underlying generating process of the rewards is stationary, the optimal selection is the one that performs best-in-hindsight, i.e., . The performance of any sequential selection strategy for may be quantified as the expected sub-optimality, or regret , defined as,


Strategies whose time-average regret approaches null, as the time horizon becomes large are called no-regret. We consider two widely-used MAB no-regret algorithms, the Upper-Confidence Bound (UCB) Lai and Robbins (1985); Agrawal (1995); Auer et al. (2002a) and Gittins Indices Gittins (1979); Gittins et al. (2011), due to both their simplicity and that they operate upon fairly different principles. Before shifting to describing how is selected for these algorithms, we identify how the structural attributes of MABs are well-suited to active sampling for meta-models.

Result: Batch
Input: Time index ;
Upper Bound ;
Exploration factor ;
: number of visits to subset until time t;
Use initial model to train on each with first batch of samples independently to obtain ;
At time :
Algorithm 2 UCB Scheduler

In meta-learning, for multi-class classification with classes for task , the different possible arms are the classes, i.e., , and the arm pulled at a given time is the class , meaning that one executes a SGD step (6) associated with class . An open question is then how to define the reward . One possibility is the statistical accuracy on the validation set :


where the indicator is when the model classifies training example correctly and null otherwise. Observe, however, that as the model and hyperparameters evolve during training, the reward will drift as the validation accuracy improves, which invalidates the stationarity hypothesis (that the distribution in (8) is stationary) underlying the guarantees of UCB and Gittins indices.

To ameliorate this issue, we use the fact that the convergence rate of SGD and its first-order variants (such as Adam) on non-convex problems exhibit a convergence rate to a first-order stationary point in terms of attenuation of the gradient norm Bottou et al. (2018)[Sec. 4.3]. Then, based upon the hypothesis that the rates of attenuation of the gradient norm and the statistical error are comparable, should be constant during training. Thus, we define the reward as


Figure 2 shows the errors of some classes in a sample meta-training subset over the first training steps in our MNIST experiment (elaborated upon in Section 4). Observe that of each state is approximately a constant over time, which provides evidence to support our hypothesis, and thus substantiates our choice of reward for linking class selection among performance on training subsets with the meta-learning validation objective [cf. (5)]. The values of may increase for larger since the model parameters may settle to the local minima and the error saturates. This is not a problem, however, as later selections influence regret less due to the accumulating sum over time in regret (8). This decrease in importance of later decisions may further be enforced through discounting that arises in UCB, Gittins Indices, and MDPs as described next.

Result: Batch
Input: Time index ;
Initilize: Compute Gittins Indices of using Algorithm 5 in Appendix B
At time :
Algorithm 3 Gittins Index Scheduler

Upper Confidence Bound

Upper Confidence Bound (UCB) operates upon the principle of optimism in the face of uncertainty. Specifically, we initialize the model associated with task via a single iteration of (6) on . Then, we count the number of times has been chosen at time as for each , i.e., and its associated average reward:

Then, UCB selection operates via calibrated perturbation from the sample mean of the reward as

where and are constants that encourage exploration. This procedure is repeated for total steps, and achieves regret that is logarithmic in the total number of steps , which is precisely the within-task mini-batch size – see Lai and Robbins (1985).We set the exploration factor . For each hyperparameter update of , a batch of samples are selected from according to those classes from which maximize the upper-confidence bound as determined by Algorithm 2. Then, these samples are used to update the hyperparameters w.r.t. the validation loss in (7).

Gittins Index

UCB is a frequentist (non-Bayesian) strategy: it does not construct any distributional model for how to select . Next we consider a Bayesian approach based upon Gittins Index, which may also be shown to be no regret Gittens and Dempster (1979). It has the additional merit that it exploits the Markovian dependencies between states by the transition matrix structure. Proceeding with its technical development necessitates a distributional model among states. For task , we construct the count-based measure:


This counting-based construction of the transition matrix between classes in has precedent in Bayesian filtering Krishnamurthy (2016)[Ch. 5]. Gittins index is then defined as


where is a measurable stopping time. Here is called Gittins index associated with reward at state , and the expectation is computed with respect to the distribution over labels for a fixed . We define the Gittins index identically as (12) for each meta-training subset as .

Result: Batch
Input: Time index ;
Initilize: Compute Value vectors solving LP (15)
At time :
Algorithm 4 MDP Scheduler

The Gittins Index Theorem establishes that a selection is optimal, i.e., no regret (8), if and only if it always selects an arm with highest Gittins index when there is Markovian dependence on the way label transitions occur Gittens and Dempster (1979), with (10) as the reward. To investigate whether this condition holds true, we use Pearson’s chi-squared test to determine whether the evidence supports the examples are not i.i.d. at 95% confident level (significant level (p-value) of 0.05). Further details and validation of the constructed transition matrices is deferred to Appendix A. In the experimental settings of Sec 4, there is significant evidence that classes exhibit Markovian dependence.

Since the reward is a constant for each class (state), based on equation (10), we approximate the reward of state in as the accuracy of fitting the first sample of label in into the initial model. The reward vector of is then . We use largest-remaining-index algorithm Varaiya et al. (1985) to compute the Gittins Index of each label in each meta-learning subset (See Appendix B). The Gittins Index Theory Gittens and Dempster (1979) states that the optimal action is to choose the bandit with highest Gittins Index at each iteration. Gittins indices are computed offline before the actual training process. Gittins Index scheduler is shown in Algorithm 3.

3.2 MDPs for Cross-Correlated Task Scheduling

In MAB, arms are assumed independent from one another in UCB and Gittins index and correlation across tasks is not permitted. However, in many applications of meta-learning, dependencies across different training subsets exist. In such a setting, the reward for arm will not remain frozen when arm

is chosen. To address this limitation, we consider using MDPs, where transition probabilities and reward functions are defined across subsets (arms)

and .

An MDP over state space and action space is one in which, starting from state , and selecting action , one moves to state with probability . Then, a reward is revealed. The canonical objective of an MDP is to select actions so as to maximize the average cumulative return, or value, defined as , where is the horizon length and is a discount factor. It’s well-known that the optimal value function satisfies Bellman’s optimality equation Puterman (2014):

Figure 3: Digit recognition experiment. Cyclically processing samples from task-specific subsets comprised of Optical Recognition Xu et al. (1992) and Semeion Handwritten Digits Buscema (1998) yields much higher sample complexity for obtaining a well-performing model on unseen MNIST data as compared to bandit schedulers: well-performing models via bandit scheduling only 200 require steps, nearly an order of magnitude reduction.

The optimal policy for each state is the action corresponding to the maximum value:


The optimal policy is time-homogeneous, i.e., assigns a fixed action to any state independent of time for . One way to obtain the optimal policy for tabular settings, i.e., when the state and action spaces are discrete and of moderate cardinality, when the transition matrix is available [cf. (11)] is via linear programming (LP) De Farias and Van Roy (2003).

Figure 4: Meta-CIFAR-100 experiment. CIFAR-100 is divided into task-specific datasets by superclasses ”aquatic mammals”, ”medium-sized mammals”, ”small mammals” and ”insect.” Then, we use the superclass ”large carnivores” as the cross-task test set. The performance gap between cyclic and active sampling is more stark for this setting, as the inherent correlation is more pronounced. Gittins Index scheduler achieves 73% accuracy and UCB achieves 58% accuracy, while cyclic sampling only has 40% accuracy.

We proceed to formulate this LP for the meta-learning scheduler policy. The state space is vector-valued consisting of the -fold Cartesian product of the set of classes , the aggregate transition model is the -fold Kronecker product of task-specific transition matrix (11), i.e., . The Kronecker product ensures the dimensionality consistence between state space and the transition model . The action determines which meta-training subset should be chosen at the next training time-slots. Moreover, the reward is given as the validation accuracy (10), as in the beginning of Sec. 3.1, except now we reinterpret the reward as being not only a function of the selected class but also the meta-learning subset as well, i.e., . This is the additional expressive power of MDPs over Gittins Index. In MDPs, the reward for the same state changes when different arms are played, which exploits both within and cross-task correlation. Then, we formulate an LP to solve for the optimal value :


. The optimal policy is computed by equation (14), where is obtained from the optimal solution in LP (15). The MDP scheduler is shown in Algorithm 4. With our various active selection schemes defined, we shift to establishing their experimental merits for improving the training of meta-models across a variety of problem contexts.

4 Experiment

We experiment the proposed MAB/MDP scheduler on three datasets with either explicit or inexplicit sample dependencies within and cross tasks. Across all experiments, we observe significant relative sample efficiency gain compared to basic cyclic sampling, demonstrating the merit of exploiting covariates in practice.

Digit Recognition

We first evaluate the performance of the schedulers on MNIST handwritten digits LeCun (1998) – MNIST forms the validation set , and the task-specific subsets are the related Optical Recognition Xu et al. (1992) and Semeion Handwritten Digit data sets Buscema (1998) – see Appendix C for additional details.

In cross-task , We select multinomial logistic as the loss , and in task specific , cross-entropy is selected as lss Murphy (2012). The specific model is a four-layer fully-connected neural network with 300 nodes per layer, and the hyperparameters concatenates the inner objective’s (the constraint in (5)) learning rate and the initialization . We use Adam Kingma and Ba (2014) with decaying learning rate as outer objective optimizer.

To evaluate the performance, we vary the batch size . We compare UCB (Algorithm 2), Gittins Index (Algorithm 3), and cyclic sampling from all subsets, where one simply passes through rows of training data one after another. Results are given in Figure 3. Because there are no strong inner dependencies between examples in MNIST dataset, Gittins index algorithm does not exhibit significant gains compared to UCB. However, both active schedulers outperform the cyclic sampling: to obtain test accuracy 80%, Gittins index requires 40 samples as compared with 53 for UCB sampling and 1300 for cyclic from test data.

UBOT TMQ U850 V850 VBOT Z100
MDP 0.901 0.873 0.917 0.870 0.774 0.842
Gittins Index 0.904 0.836 0.845 0.653 0.738 0.877
UCB 0.673 0.649 0.684 0.421 0.600 0.619
Cyclic 0.352 0.043 0.304 0.480 0.592 0.448
Table 2: Overall Test Classification Accuracy on Various Features using Different Schedulers. MDP and Gittins Index Schedulers outperform UCB and cyclic scheduling.


The CIFAR-100 dataset is an image dataset containing 100 classes with 600 images each Krizhevsky (2009). We construct 4 task-specific meta-training subsets: each task is associated with a superclass, that is, we form meta-training subsets consisting entirely of a single superclass. This defines a classification problem associated with those classes within it – see Appendix C.

We use cross entropy as both the inner and outer loss functions and employ a four-layer CNNs with strided convolutions and 64 filters per layer. The hyperparameters are the same as in the Digit Recognition – see Appendix


Figure 4 shows the result of using Gittins Index and UCB compared with cyclic sampling. Note the significant improvements in sample efficiency and the superior limit point to which the model converges when using active selection as compared with cyclic passes through task-specific samples. Moreover, Gittins index outperforms UCB, which is evidence that inherent correlation in the class and task structure is more pronounced for this setting. To achieve 40% accuracy, Gittins Index scheduler requires 1400 samples, while UCB requires 2000 samples and cyclic scheduler needs 5000 samples, meaning they are respectively and more efficient than cyclic sampling.

(a) U850
(b) V850
Figure 5: Evolution of multi-Classification accuracy when using various features. MDP and Gittins Index Schedulers outperform UCB and cyclic scheduling.

Extreme Weather

Gittins index, as compared to UCB, employs the Markovian transition matrix [cf. (11)] to select the next sample (12), and thus leverages dependencies between classes. In principle, the merit of modeling correlations may be greater when the order of the data has physical meaning. This is not obvious in the case for Meta CIFAR-100 and Digit Recognition. To further investigate the merit of exploiting covariates between samples, we focus on an instance arising in meteorology, as the physical meaning of ordering is inherent due to, e.g., the water cycle.

Data Preparation We consider the Extreme Weather Dataset Racah et al. (2017): training data consists of image patterns of various features and the bounding boxes (prescribed regions) on the images label a specific extreme weather type (considered as class). We use various bounding boxes with different features to construct the meta training, validation and test sets – see Appendix C for details.

Result Our results are summarized in Table 2 and Figure 5. In Appendix D, one may observe that the constructed transition matrices are diagonally dominant, meaning that covariates between neighboring events/classes are more significant. Thus, it is no surprise that in Table 2, one may observe that MDP and Gittins index schedulers outperform other two scheduling policies in all experiments, as they are designed to exploit correlation. Mostly, MDP outperforms Gittins Index, showing that cross-task covariates also have obvious positive effect during training; whereas in some cases, UCB performs comparably to periodic sampling.

We also compare our results with Liu et al. (2016), which uses a CNN with hyperparameter optimization to perform the binary classifications on different weather events using multiple features. We use similar features as Liu et al. (2016) described, but with single feature in each test. Although the accuracy we obtain is not comparable, we obtain moderate accuracy with a much simpler correlation model. Specifically, with only 5000 five-features images of size of 32 32, which is 90% fewer examples than Liu et al. (2016), we achieve 70-90% of the accuracy. Moreover, we focus on multi-class problems, which is significantly more challenging than binary classification. Thus, MDPs and Gittins Index schedulers can significantly improve training efficiency. See Appendix D for further details.

5 Conclusion

We departed from prior works on meta-learning that presume independence between tasks by directly considering within and across-task correlation. We proposed a module to select samples according to their contribution to meta-model validation accuracy, which yielded significant sample efficiency gains across a variety of domains as compared to cyclic passes through data. Rigorously analyzing these sample efficiency gains is the subject of future work.


  • R. Agrawal (1995) Sample mean based index policies with o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability, pp. 1054–1078. Cited by: §3.1.
  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016) Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §1, §3.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002a) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §3.1.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002b) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2), pp. 235–256. External Links: Document, ISBN 1573-0565, Link Cited by: §1.
  • R. Bellman (1957) A markovian decision process. Indiana Univ. Math. J. 6, pp. 679–684. External Links: ISSN 0022-2518 Cited by: §1.
  • L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §3.1.
  • M. Buscema (1998) Metanet*: the theory of independent judges. Substance use & misuse 33 (2), pp. 439–461. Cited by: Figure 3, §4.
  • D. Cai, R. Sheth, L. Mackey, and N. Fusi (2020) Weighted meta-learning. arXiv preprint arXiv:2003.09465. Cited by: §3.
  • A. Chiuso and G. Pillonetto (2019) System identification: a machine learning perspective. Annual Review of Control, Robotics, and Autonomous Systems 2, pp. 281–304. Cited by: §1.
  • D. A. Cohn, Z. Ghahramani, and M. I. Jordan (1996) Active learning with statistical models.

    Journal of artificial intelligence research

    4, pp. 129–145.
    Cited by: §1, §2.
  • W. Dai, Q. Yang, G. Xue, and Y. Yu (2007) Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, New York, NY, USA, pp. 193–200. External Links: ISBN 9781595937933, Link, Document Cited by: §1.
  • D. P. De Farias and B. Van Roy (2003) The linear programming approach to approximate dynamic programming. Operations research 51 (6), pp. 850–865. Cited by: §1, §3.2.
  • X. Du, M. El-Khamy, J. Lee, and L. Davis (2017) Fused dnn: a deep neural network fusion approach to fast and robust pedestrian detection. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 953–961. Cited by: §1.
  • A. Fallah, A. Mokhtari, and A. Ozdaglar (2020) On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1082–1092. Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §1, §2, §3.
  • L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. ICML 2018. External Links: Document Cited by: §2.
  • J. Gittens and M. Dempster (1979) Bandit processes and dynamic allocation indices [with discussion]. Journal of the Royal Statistical Society. Series B: Methodological 41, pp. 148–177. External Links: Document Cited by: §3.1, §3.1, §3.1.
  • J. C. Gittins (1979) Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological) 41 (2), pp. 148–164. Cited by: §3.1.
  • J. Gittins, K. Glazebrook, and R. Weber (2011) Multi-armed bandit allocation indices. John Wiley & Sons. Cited by: §3.1.
  • K. Gopalakrishnan, S. K. Khaitan, A. Choudhary, and A. Agrawal (2017) Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Construction and Building Materials 157, pp. 322 – 330. External Links: ISSN 0950-0618, Document, Link Cited by: §1.
  • D. J. Im, Y. Jiang, and N. Verma (2019) Model-agnostic meta-learning using runge-kutta methods. arXiv preprint arXiv:1910.07368. Cited by: §3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • A. Koppel, J. Fink, G. Warnell, E. Stump, and A. Ribeiro (2016) Online learning for characterizing unknown environments in ground robotic vehicle models. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 626–633. Cited by: §1.
  • V. Krishnamurthy (2016) Partially observed markov decision processes. Cambridge University Press. Cited by: §3.1.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Tront. Cited by: §C.2, §4.
  • A. Krizhevsky (2012) Learning multiple layers of features from tiny images. University of Toronto, pp. . Cited by: §1.
  • T. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6 (1), pp. 4–22. External Links: ISSN 0196-8858, Link, Document Cited by: §3.1, §3.1.
  • T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: §1, §3.1.
  • E. G. Learned-Miller (2011) Supervised learning and bayesian classification. External Links: Link Cited by: §1.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §4.
  • V. Likhosherstov, X. Song, K. Choromanski, J. Davis, and A. Weller (2020) UFO-blo: unbiased first-order bilevel optimization. arXiv preprint arXiv:2006.03631. Cited by: §2.
  • S. Liu and L. N. Vicente (2019) The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning. arXiv preprint arXiv:1907.04472. Cited by: §1.
  • Y. Liu, E. Racah, Prabhat, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. F. Wehner, and W. D. Collins (2016) Application of deep convolutional neural networks for detecting extreme weather in climate datasets. CoRR abs/1605.01156. External Links: Link, 1605.01156 Cited by: §4.
  • K. P. Murphy (2012) Machine learning: a probabilistic perspective. Cited by: §4.
  • I. Nicholas, H. Kuo, M. Harandi, N. Fourrier, C. Walder, G. Ferraro, and H. Suominen (2020) M2SGD: learning to learn important weights. In

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    pp. 957–964. Cited by: §3.
  • J. Nocedal and S. Wright (2006) Numerical optimization. Springer Science & Business Media. Cited by: §3.
  • J. Pan, C. Liu, Z. Wang, Y. Hu, and H. Jiang (2012) Investigation of deep neural networks (dnn) for large vocabulary continuous speech recognition: why dnn surpasses gmms in acoustic modeling. In 2012 8th International Symposium on Chinese Spoken Language Processing, Vol. , pp. 301–305. Cited by: §1.
  • E. Park and J. B. Oliva (2019) Meta-curvature. In Advances in Neural Information Processing Systems, pp. 3314–3324. Cited by: §3.
  • M. E. Peters, S. Ruder, and N. A. Smith (2019) To tune or not to tune? adapting pretrained representations to diverse tasks. ACL 2019, pp. 7. Cited by: §1.
  • M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §1, §3.2.
  • E. Racah, C. Beckham, T. Maharaj, S. Kahou, Mr. Prabhat, and C. Pal (2017) ExtremeWeather: a large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3405–3416. External Links: Link Cited by: §C.3, §1, §1, §4.
  • A. Radford, L. Metz, and S. Chintala (2015)

    Unsupervised representation learning with deep convolutional generative adversarial networks

    External Links: 1511.06434 Cited by: §1.
  • A. Rajeswaran, C. Finn, S. Kakade, and S. Levine (2019) Meta-learning with implicit gradients. External Links: 1909.04630 Cited by: §1.
  • B. Settles (2011) From theories to queries: active learning in practice. In Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, pp. 1–18. Cited by: §2.
  • C. Shorten and T. M. Khoshgoftaar (2019)

    A survey on image data augmentation for deep learning

    Journal of Big Data 6 (1), pp. 60. Cited by: §1.
  • C. Simon, P. Koniusz, R. Nock, and M. Harandi (2020) On modulating the gradient for meta-learning. Cited by: §3.
  • X. Song, W. Gao, Y. Yang, K. Choromanski, A. Pacchiano, and Y. Tang (2019) ES-maml: simple hessian-free meta learning. In International Conference on Learning Representations, Cited by: §3.
  • C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning – ICANN 2018, V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Eds.), Cham, pp. 270–279. External Links: ISBN 978-3-030-01424-7 Cited by: §1.
  • P. Varaiya, J. Walrand, and C. Buyukkoc (1985) Extensions of the multiarmed bandit problem: the discounted case. IEEE Transactions on Automatic Control 30 (5), pp. 426–439. Cited by: §3.1.
  • R. Wang, J. Lehman, J. Clune, and K. O. Stanley (2019) Paired open-ended trailblazer (poet): endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753. Cited by: §1.
  • Y. Wang, X. Wu, Q. Li, J. Gu, W. Xiang, L. Zhang, and V. O. Li (2018) Large margin meta-learning for few-shot classification. In Neural Information Processing Systems (NIPS) Workshop on Meta-Learning, Montreal, Canada, Cited by: §1.
  • L. Xu, A. Krzyzak, and C. Y. Suen (1992) Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE transactions on systems, man, and cybernetics 22 (3), pp. 418–435. Cited by: Figure 3, §4.
  • M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn (2019) Meta-learning without memorization. External Links: 1912.03820 Cited by: §1.
  • W. Yin, K. Kann, M. Yu, and H. Schütze (2017)

    Comparative study of cnn and rnn for natural language processing

    External Links: 1702.01923 Cited by: §1.
  • T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557. Cited by: §1.

Supplementary Material for
“A Markov Decision Process Approach to Active Meta Learning”

In the supplementary material, we provide additional details regarding the construction of meta-learning tasks and evaluations, the associated data sets, and quantities constructed toward these ends.

Appendix A Determine Sample Dependencies in Meta-training Subsets Using Chi-squared Test

First, we focus on the statistical validation of the transition matrices constructed as (11) for the various data sets. These transition matrices are essential to the constructing Gittins Index (12) and the policy associated with an MDPs (15). Our goal here is to determine whether the constructed transition matrices provide evidence that classes and tasks exhibit any significant correlation effects.

To do so, we use the Pearson’s Chi-Squared to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies at the 95% confident level, i.e., p value of 0.05. The null hypothesis is samples are i.i.d. in each subset. If the statistical test rejects the null hypothesis, i.e., p-value

0.05, Gittins Index or MDPs are justified for scheduling. Under independence, the rows of the constructed Markov chain induced by the transition matrix

are identical for a fixed . Table 3 shows the p-values of meta-training subsets in MNIST and meta CIFAR-100 experiments. The p-values of subsets in Extreme Weather experiment are all nearly 0.

Subset 1 Subset 2 Subset 3 Subset 4 Subset 5
p-value 0.0314 0.00836
(a) Digit Subsets
Subset 1 Subset 2 Subset 3 Subset 4
p-value 0.0302 0.00986 0.00215 0.00351
(b) Meta CIFAR-100
Table 3: p-values of meta-training subsets in MNIST and Meta CIFAR-100. p-values for the Extreme Weather data set are identically near null, and the transition matrix is diagonally dominant – see Appendix D.
Result: Gittins Indices
State (label) space
N meta training subsets ,
Transition Matrices of each subset ,
Discount factor
for  do
       Fit the first sample of each label in into the initial model independently and get the reward vector
end for
for  do
       Compute gittins index of each subset :
       for  do
       end for
end for
Algorithm 5 Compute Gittins Indices of States in Meta Training Subsets

This provides substantial evidence across the different data domains that classes and tasks exhibit Markovian dependence, which is evidence that exploiting correlation effects may be useful for scheduling.

Appendix B Largest-remaining-index Algorithm for Gittins Index in Meta Learning

We use largest-remaining-index algorithm to compute the Gittins Index of each state (class) in each meta-learning subset . We elaborate upon how this procedure works next. Suppose the state space for a given subset is . First step is to identify state (class) with the highest Gittins index:

Next step is the recursion to find state with th largest Gittins index. Define continuation set as and stopping set as . Then state and its associated Gittins Index can be computed using a matrix and two vectors , which are shown in detail in Algorithm 5. This procedure is then used in the Gittins Index based scheduler summarized in Algorithm 3.

Appendix C Additional Details of Experiments

We elaborate upon the meta-learning problem formulation in terms of data preparation and allocation, parameter selection, loss function specification, etc. for the experimental results presented in Section 4. These points are collated into Table 4 for convenience.

Meta-training subsets Within-task loss Cross-task loss Neural net Hyperparameters
Digit Recognition
2 subsets from Semeion Dataset
3 subsets from Opt. Reconition Dataset
1400 samples each subset
Cross-entropy Multinomial logistic
4-layer fully connected DNN
300 nodes per layer
DNN initial weights and biases
Within-task objective learning rate
Meta CIFAR-100
4 subsets from superclasses
aquatic mammals, medium-sized mammals
small mammals, insect
500 samples per subset
Cross-entropy Cross-entropy
4-layer CNNs with strided convolutions
64 filters per layer
DNN initial weights and biases
Within-task objective learning rate
Extreme Weather
5 subsets from first 5 bounding boxes
each subset conatains different 5 features
500 samples per subset
Cross-entropy Cross-entropy
4-layer CNNs with strided convolutions
64 filters per layer
DNN initial weights and biases
Within-task objective learning rate
Table 4: Experimental setup: data description, parameter selection, architecture specification, loss functions, meta-model definition.

c.1 Digit Recognition

We construct meta-training subsets with samples per set. Two are selected from Semeion dataset, and the data from the other three sets are from Optical Recognition Dataset. We construct a common validation set with size 1400 from the two datasets above to evaluate the performance after each hyper iteration. The performance of this procedure is evaluated on a test set comprised of 60000 samples from MNIST dataset. The size of digit images from Optical Recognition dataset and Semeion dataset is different from the size of MNIST images. So we resize the traning and validation image to in order to ensure images have compatible dimensionality.

c.2 Meta CIFAR-100

The CIFAR-100 dataset is an image dataset containing 100 classes with 600 images each Krizhevsky (2009). There are 500 training images and 100 testing images per class. The 100 classes are grouped into 20 superclasses, each of which contains classes. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). We construct the task-specific subsets where each task is associated with a superclass, that is, we form data sets consisting entirely of a single superclass, which defines a classification problem associated with those classes within it. Superclasses consist of “aquatic mammals”, “medium-sized mammals”, “small mammals” and “insect.” Then, we use the superclass “large carnivores” as the cross-task validation set. This construction we call Meta-CIFAR-100.

c.3 Extreme Weather

We consider the Extreme Weather Dataset Racah et al. (2017), where samples from both climate simulations and re-analysis are considered. The reanalysis samples are generated by assimilating observations into a climate model. Ground truth labeling of various events is obtained via multivariate threshold based criteria implemented in TECA, and manual labeling by experts Racah et al. (2017). Training data consists of image patterns, where several relevant spatial variables are stacked together over a prescribed region (called bounding box) that bounds a type of weather event, which is considered as ground truth label. The dimension of the bounding box is based domain knowledge of events observed in the real word. There are 1460 example images (4 per day, 365 days in the year) arranged in time order for each year’s dataset. We only used 2005’s dataset for the experiment. Each image has 16 channels corresponding to 16 features. Each channel is 768 x 1152 corresponding to one measurement per 25 square km on earth.

We first build the Meta training subsets. For each image, there are up to 15 bounding boxes, where each box indicates a prescribed region in the image that bounds a type of extreme weather event. We used these bounding boxes to split the dataset into different subsets of meta-training set. The first box of each image forms the first subset, the second boxes form the second subset, and so on. Only the first 5 boxes of each image are used, so in total we have 5 different tasks. In order to better differentiate tasks, each subset uses different 5 among 16 features and the features used in each subset are not identical. The first five bounding boxes forms the 5 subsets with 500 images each, another 50 images with all bounding boxes and 5 features are used for validation and other images with all bounding boxes with only one feature are used for testing. Because of the spatial dimension of climate events vary significantly and the spatial resolution of source data is non-uniform, the bounding boxes are resized to 32 32.

Appendix D Additional Result of Extreme Weather Experiment

We present a sample transition matrix of the task-specific data subset via (11) below:

The transition matrix is diagonal-dominant which means that the examples in the dataset are highly correlated. The same type of weather event or its neighbor type of event are likely to happen after one type of extreme weather happens. Combining this structure of likelihood with reward vectors obtained, which are the initial validation accuracy, the Gittins Index reflects the relative ”importance” of each state in each arm during the training process. Following the Gittins Index policy we can find the optimal stopping time on one meta-training set and the next dataset the ML model should learn.

Table 5 displays the summary of examples used in each meta training subset to train the ML model using different schedulers, and feature U850 in test set. Observe that for MDP and Gittins Index scheduler, each meta-training subset contributes to training different types of weather events while training set 4 is rarely scheduled, which indicates that it contributes little towards validation performance for any of type of events. This filtering out of irrelevant information makes training the meta-learner more efficient. The overall classification accuracy for each weather type at the end of training is summarized in Table 6. Since the schedulers select more samples labeled as Tropical Cyclone and Extratropic Cyclone, the classification accuracy on these weather types are higher in general.

Trop. Depression Trop. Cyclone Extratropic Cyclone Atmo. River
Subset 1 140 0 0 0
Subset 2 10 3190 0 0
Subset 3 230 0 1150 0
Subset 4 0 0 20 0
Subset 5 0 0 0 260
(a) MDP Scheduler
Trop. Depression Trop. Cyclone Extratropic Cyclone Atmo. River
Subset 1 440 20 0 0
Subset 2 0 1870 0 0
Subset 3 0 10 2450 0
Subset 4 0 10 0 0
Subset 5 0 0 0 210
(b) Gittins Index Scheduler
Trop. Depression Trop. Cyclone Extratropic Cyclone Atmo. River
Subset 1 120 830 50 0
Subset 2 170 650 180 0
Subset 3 90 430 490 0
Subset 4 90 140 670 100
Subset 5 40 90 700 160
(c) UCB Scheduler
Table 5: Summary of Examples used in Meta-training subsets, each subset uses different 5 features. The test set uses feature U850. By exploiting correlation, samples associated with certain classes and tasks are significantly down-sampled.
Trop. Depression Trop. Cyclone Extratropic Cyclone Atmo. River
MDP 0.789 0.961 0.947 0.658
Gittins Index 0.421 0.836 0.963 0.395
UCB 0.368 0.698 0.788 0.421
Table 6: Test Classification Accuracy of each Weather Type using Feature U850