1 Introduction
The key to achieving human level intelligence is to learn from a few labeled examples. Human can learn and adapt quickly from a few examples using prior experience. We want our learner to be able to learn from a few examples and quickly adapt to a changing task. All these concerns motivate to study the fewshot learning problem. The advantage of studying the fewshot problem is that it only relies on few examples and it alleviates the need to collect large amount of labeled training set which is a cumbersome process.
Recently, metalearning approach is being used to tackle the problem of fewshot learning. A metalearning model usually contains two parts – an initial model, and an updating strategy (e.g., a parameterized model) to train the initial model to a new task with few examples. Then the goal of metalearning is to automatically metalearn the optimal parameters for both the initial model and the updating strategy that are generalizable across a variety of tasks. There are many metalearning approaches that show promising results on fewshot learning problems. For example, MetaLSTM [1] uses LSTM metalearner that not only learns initial model but also the updating rule. On the contrary, MAML [2] only learns an initial model since its updating rule is fixed to a classic gradient descent method as a metalearner.
The problem with existing metalearning approaches is that the initial model can be trained biased towards some tasks, particularly those sampled in metatraining phase. Such a biased initial model may not be well generalizable to an unseen task that has a large deviation from metatraining tasks, especially when very few examples are available on the new task. This inspires us to metatrain an unbiased initial model by preventing it from overperforming on some tasks or directly minimizing the inequality of performances across different tasks, in a hope to make it more generalizable to unseen tasks. To this end, we propose a TaskAgnostic MetaLearning (TAML) algorithms in this paper.
Specifically, we propose two novel paradigms of TAML algorithms – an entropybased TAML and inequalityminimization measures based TAML. The idea of using entropy based approach is to maximize the entropy of labels predicted by the initial model to prevent it from overperforming on some tasks. However, the entropybased approach is limited to discrete outputs from a model, making it more amenable to classification tasks.
The second paradigm is inspired by inequality measures used in Economics. The idea is to metatrain an initial model in such a way that it directly minimizes the inequality of losses by the initial model across a variety of tasks. This will force the metalearner to learn a unbiased initial model without overperforming on some particular tasks. Meanwhile, any form of losses can be adopted for involved task without having to rely on discrete outputs. This makes this paradigm more ubiquitous to many scenarios beyond classification tasks.
2 Approach
Our goal is to train a model that can be taskagnostic in a way that it prevents the initial model or learner to overperform on a particular task. In this section, we will first describe our entropy based and inequalityminimization measures based approach to the problem, and then we will discuss some of the inequality measures that we used in the paper.
2.1 Task Agnostic MetaLearning
In this section, we propose a taskagnostic approach for fewshot metalearning. The goal of fewshot metalearning is to train a model in such a way that it can learn to adapt rapidly using few samples for a new task. In this metalearning approach, a learner is trained during a metalearning phase on variety of sampled tasks so that it can learn new tasks , while a metalearner trains the learner and is responsible for learning the update rule and initial model.
The problem with the current metalearning approach is that the initial model or learner can be biased towards some tasks sampled during the metatraining phase, particularly when future tasks in the test phase may have discrepancy from those in the training tasks. In this case, we wish to avoid an initial model overperforming on some tasks. Moreover, an overperformed initial model could also prevent the metalearner to learn a better update rule with consistent performance across tasks.
To address this problem, we impose an unbiased taskagnostic prior on the initial model by preventing it from overperforming on some tasks so that a metalearner can achieve a more competitive update rule. There have been many metalearning approaches to fewshot learning problems that have been briefly discussed in the section 3. While the taskagnostic prior is a widely applicable principle for many metalearning algorithms, we mainly choose ModelAgnostic Meta Learning approach (MAML) as an example to present the idea, and it is not hard to extend to other metalearning approaches.
In the following, we will depict the idea by presenting two paradigms of taskagnostic metalearning (TAML) algorithms – the entropymaximization/reduction TAML and inequalityminimization TAML.
2.1.1 EntropyMaximization/Reduction TAML
For simplicity, we express the model as a function that is parameterized by
. For example, it can be a classifier that takes an input example and outputs its discrete label. During metatraining, a batch of tasks are sampled from a task distribution
, and each task is shot way problem where represents the number of training examples while represent the number of classes depending on the problem setting. In the MAML, a model is trained on a task using examples and then tested on a few new examples for this task.A model has an initial parameter and when it is trained on the task , its parameter is updated from to by following an updating rule. For example, for
shot classification, stochastic gradient descent can be used to update model parameter by
that attempts to minimize the crossentropy loss for the classification task over examples.To prevent the initial model
from overperforming on a task, we prefer it makes a random guess over predicted labels with an equal probability so that it is not biased towards the task. This can be expressed as a maximumentropy prior over
so that the initial model should have a large entropy over the predicted labels over samples from task .The entropy for task is computed by sampling from over its output probabilities over predicted labels:
(1) 
where is the predictions by
, which are often an output from a softmax layer in a classification task. The above expectation is taken over
’s sampled from task .Alternatively, one can not only maximize the entropy before the update of initial model’s parameter, but also minimize the entropy after the update. So overall, we maximize the entropy reduction for each task as . The minimization of means that the model can become more certain about the labels with a higher confidence after updating the parameter to . This entropy term can be combined with the typical metatraining objective term as a regularizer to find the optimal , which is
where is a positive balancing coefficient, and the first term is the expected loss for the updated model . The entropyreduction algorithm is summarized in 1.
Unfortunately, the entropybased TAML is subject to a critical limitation – it is only amenable to discrete labels in classification tasks to compute the entropy. In contrast, many other learning problems, such as regression and reinforcement learning problems, it is often trained by minimizing some loss or error functions directly without explicitly accessing a particular form of outputs like discrete labels. To make the TAML widely applicable, we need to define an alternative metric to measure and minimize the bias across tasks.
2.1.2 InequalityMinimization TAML
We wish to train a taskagnostic model in metalearning such that its initial performance is unbiased towards any particular task . Such a taskagnostic metalearner would do so by minimizing the inequality of its performances over different tasks.
To this end, we propose an approach based on a large family of statistics used to measure the "economic inequalities" to measure the "task bias". The idea is that the loss of an initial model on each task is viewed as an income for that task. Then for the TAML model, its loss inequality over multiple tasks is minimized to make the metalearner taskagnostic.
Specifically, the bias of the initial model towards any particular tasks is minimized during metatraining by minimizing the inequality over the losses of sampled tasks in a batch. So, given an unseen task during testing phase, a better generalization performance is expected on the new task by updating from an unbiased initial model with few examples. The key difference between both TAMLs lies that for entropy, we only consider one task at a time by computing the entropy of its output labels. Moreover, entropy depends on a particular form or explanation of output function, e.g., the SoftMax output. On the contrary, the inequality only depends on the loss, thus it is more ubiquitous.
The complete algorithm is explained in 2. Formally, consider a batch of sampled tasks and their losses by the initial model , one can compute the inequality measure by as discussed later. Then the initial model parameter is metalearned by minimizing the following objective
through gradient descent as shown in Algorithm 2. It is worth noting that the inequality measure is computed over a set of losses from sampled tasks. The first term is the expected loss by the model after the update, while the second is the inequality of losses by the initial model before the update. Both terms are a function of the initial model parameter since is updated from . In the following, we will elaborate on some choices on inequality measures .
2.2 Inequality Measures
Inequality measures are instrumental towards calculating the economic inequalities in the outcomes that can be wealth, incomes, or health related metrics. In metalearning context, we use to represent the loss of a task , represents the mean of the losses over sampled tasks, and is the number of tasks in a single batch. The inequality measures used in TAML are briefly described below.
Theil Index [3].
This inequality measure has been derived from redundancy in information theory, which is defined as the difference between the maximum entropy of the data and an observed entropy. Suppose that we have losses , then Thiel Index is defined as
(2) 
Generalized Entropy Index [4].
The relation between information theory and information distribution analysis has been exploited to derive a number of measures for inequality. Generalized Entropy index has been proposed to measure the income inequality. It is not a single inequality measure, but it is a family that includes many inequality measures like Thiel Index, Thiel L etc. For some real value , it is defined as:
(3) 
From the equation, we can see that it does represent a family of inequality measures. When is zero, it is called a mean log deviation of Thiel L, and when is one, it is actually Thiel Index. A larger GE value makes this index more sensitive to differences at the upper part of the distribution, and a smaller value makes it more sensitive to differences at the bottom of the distribution.
Atkinson Index [5].
It is another measure for income inequality which is useful in determining which end of the distribution contributed the most to the observed inequality. It is defined as :
(4) 
where is called "inequality aversion parameter". When the index becomes more sensitive to the changes in upper end of the distribution ,and when it approaches to 1, the index becomes more sensitive to the changes in lower end of the distribution.
GiniCoefficient [6].
It is usually defined as the half of the relative absolute mean difference. In terms of metalearning, if there are M tasks in a single batch and a task loss is represented by , then GiniCoefficient is defined as:
(5) 
Gini coefficient is more sensitive to deviation around the middle of the distribution than at the upper or lower part of the distribution.
Variance of Logarithms [7].
It is another common inequality measure defined as:
(6) 
where g(
) is the geometric mean of
which is defined as . The geometric mean put greater emphasis on the lower losses of the distribution.3 Related Work
The idea of metalearning has been proposed more than a couple of decades ago [8, 9, 10]
. Most of the approaches to metalearning include learning a learner’s model by training a metalearner. Recent studies towards metalearning for deep neural networks include learning a handdesigned optimizer like SGD by parameterizing it through recurrent neural networks. Li
[11], and Andrychowicz [12] studied a LSTM based metalearner that takes the gradients from learner and performs an optimization step. Recently, metalearning framework has been used to solve fewshot classification problems. [1] used the same LSTM based metalearner approach in which LSTM metalearner takes the gradient of a learner and proposed an update to the learner’s parameters. The approach learns both weight initialization and an optimizer of the model weights. Finn [2] proposed a more general approach for metalearning known as MAML by simply learning weight initialization for a learner through a fixed gradient descent. It trains a model on a variety of tasks to have a good initialization point that can be quickly adapted (few or one gradient steps) to a new task using few training examples. MetaSGD [13] extends the MAML, which not only learns weight initialization but also the learner’s update step size. [14] proposes a temporal convolution and attention based metalearner called SNAIL that achieves stateoftheart performance for fewshot classification tasks and reinforcement learning tasks.Other paradigms of metalearning approaches include training a memory augmented neural network on existing tasks by coupling with LSTM or feedforward neural network controller
[15, 16]. There are also several nonmetalearning approaches to fewshot classification problem by designing specific neural architectures. For example, [17] trains a Siamese network to compare new examples with existing ones in a learned metric space. Vinyals [18]used a differentiable nearest neighbour loss by utilizing the cosine similarities between the features produced by a convolutional neural network.
[19] proposed a similar approach to matching net but used a square euclidean distance metric instead. In this paper, we mainly focus on the metalearning approaches and their applications to fewshot classiciation and reinforcement tasks.Methods 



MANN, no conv [15]  82.8%  94.9%  
MAML, no conv [2]  89.7 1.1%  97.5 0.6 %(96.1 0.4)%*  
TAML(Entropy), no conv  91.19 1.03%  97.40 0.34%  
TAML(Theil), no conv  91.37 0.97%  96.84 0.36%  
TAML(GE(2)), no conv  91.3 1.0%  96.76 0.4%  
TAML(Atkinson), no conv  91.77 0.97%  97.0 0.4%  
TAML (GiniCoefficient), no conv  93.17 1.0%    
Siamese Nets [17]  97.3%  98.4%  
Matching Nets [18]  98.1%  98.9%  
Neural Statistician [20]  98.1%  99.5%  
Memory Mod. [21]  98.4%  99.6%  
Prototypical Nets [19]  98.8%  99.7%  
Meta Nets [16]  98.9%    
Snail [14]  99.07 0.16%  99.78 0.09%  
MAML [2]  98.7 0.4%  99.9 0.1%  
TAML(Entropy)  99.23 0.35%  99.71 0.1%  
TAML(Theil)  99.5 0.3% 


TAML(GE(2))  99.47 0.25 % 


TAML(Atkinson)  99.37 0.3%  99.77 0.1%  
TAML (GiniCoefficient)  99.3 0.32%  99.70 0.1%  
TAML(GE(0))  99.33 0.31%  99.75 0.09%  
TAML (VL)  99.1 0.36%  99.6 0.1% 
shows 95% confidence interval over tasks.
4 Experiments
We report experiment results in this section to evaluate the efficacy of the proposed TAML approaches on a variety of fewshot learning problems on classification and reinforcement learning.
4.1 Classification
We use two benchmark datasets Omniglot and MiniImagenet for fewshot classification problem. The Omniglot dataset has 1623 characters from 50 alphabets. Each character has 20 instances which are drawn by different individuals. We randomly select 1200 characters for training and remaining for testing. From 1200 characters, we randomly sample 100 for validation. As proposed in [15], the dataset is augmented with rotations by multiple of 90 degrees.
The MiniImagenet dataset was proposed by
[18] and it consists of 100 classes from Imagenet dataset. We used the same split proposed by [1] for fair comparison. It involves 64 training classes, 12 validation classes and 20 test classes. We consider 5way and 20way classification for both 1shot and 5shot.For shot way classification, we first sample unseen classes from training set and for every unseen class, we sample different instances. We follow the same model architecture used by [18]
. The Omniglot dataset images are downsampled by 28x28 and we use a strided convolutions instead of maxpooling. The MiniImagenet images are downsampled to 84x84 and we used 32 filters in the convolutional layers. We also evaluate the proposed approach on nonconvolutional neural network. For a fair comparison with MANN
[15] and MAML [2], we follow the same architecture used by MAML [2]. We use LeakyReLU as nonlinearity instead of ReLU nonlinearity.
We train and evaluate the metamodels based on TAML that are unbiased and show they can be adapted to new tasks in few iterations as how they are metatrained. For Omniglot dataset, we use a batch size of 32 and 16 for 5way and 20way classification, respectively. We follow [2] for other training settings. For fair comparison with MetaSGD on 20way classification, the model was trained with 1 gradient step. For 5way MiniImagenet, we use a batch size of 4 for both 1shot and 5shot settings. For 20way classification on MiniImagenet, the learning rate was set to 0.01 for both 1shot and 5shot, and each task is updated using onegradient step. All the models are trained for 60000 iterations. We use the validation set to tune the hyperparameter for both the approaches.
4.1.1 Results
We report the results for 5way Omniglot for both fully connected network and convolutional network. The convolutional network learned by TAML outperforms all the stateoftheart methods in Table 1. For 20way classification, we reran the MetaSGD algorithm with our own training/test splitting for fair comparison since the MetaSGD is not opensourced and their training/test split is neither available. The results are reported in the Table 2. It can be shown that TAML outperforms MAML and MetaSGD for both 1shot and 5shot settings.
For MiniImagenet, the proposed TAML approaches outperform the compared ones for 5way classification problem. The entropy based TAML achieves the best performance compared with inequalityminimization TAML for 5shot problem. For 20way setting, we use the reported results from MetaSGD for both MAML and MetaSGD. We outperform both MAML and MetaSGD for both 1shot and 5shot settings. It is interesting to note that MAML performs poor compared with matching nets and Metalearner LSTM when it is trained using one gradient step as reported in Table 3.
Methods 



MAML* [2]  90.81 0.5%  97.49 0.15%  
MetaSGD* [13]  93.98 0.43%  98.42 0.11%  
TAML(Entropy + MAML)  95.62 0.5%  98.64 0.13%  
TAML(Theil + MetaSGD)  95.15 0.39%  98.56 0.1%  
TAML(Atkinson + MetaSGD)  94.91 0.42%  98.50 0.1%  
TAML (VL + MetaSGD)  95.12 0.39%  98.58 0.1%  
TAML(Theil + MAML)  92.61 0.46%  98.4 0.1%  
TAML(GE(2) + MAML)  91.78 0.5%  97.93 0.1%  
TAML(Atkinson + MAML)  93.01 0.47%  98.21 0.1%  
TAML(GE(0) + MAML)  92.95 0.5%  98.2 0.1%  
TAML (VL + MAML)  93.38 0.47%  98.54 0.1% 
4.2 Reinforcement Learning
In reinforcement learning, the goal is to learn the optimal policy given fewer trajectories or experiences. A reinforcement learning task
is defined as Markov Decision Process that consists of a state space
, an action space , the reward function , and statetransition probabilities where is the action at time step . In our experiments, we are using the same settings as proposed in [2] where we are sampling trajectories using policy. The loss function used is the negative of the expectation of the sum of the rewards,
Experiments were performed using rllab suite [22]. Vanilla policy gradient [23] is used to for inner gradient updates while trust region policy optimizer (TRPO) [24] is used as metaoptimizer. The algorithm is the same as mentioned in algorithm 2 with the only difference bing that trajectories were sampled instead of images.
For reinforcement learning experiment, we evaluate TAML on a 2D navigation task. The policy network that was used in performing this task is identical to the policy network that was used in [2] for a fair comparison, which is a threelayered network using ReLU while setting the step size . The experiment consists an agent moving in twodimensional environment and the goal of the agent is to reach the goal state that is randomly sampled from a unit square. For evaluation purposes, we compare the results of TAML with MAML, oracle policy, conventional pretraining and random initialization. Our results have shown that GE(0), Theil, and GE(2) TAML perform onpar with MAML after 2 gradient steps but start to outperform it afterwards as shown in figure 1.
Methods 




Finetune  28.86 0.54%  49.79 0.79%      
Nearest Neighbors  41.08 0.70%  51.04 0.65%      
Matching Nets [18]  43.56 0.84%  55.31 0.73%  17.31 0.22%  22.69 0.20%  
MetaLearn LSTM [1]  43.44 0.77%  60.60 0.71%  16.70 0.23%  26.06 0.25%  
MAML (firstorderapprox.) [2]  48.07 1.75%  63.15 0.91%      
MAML [2]  48.70 1.84%  63.11 0.92%  16.49 0.58%  19.29 0.29%  
MetaSGD [13]  50.47 1.87%  64.03 0.94%  17.56 0.64%  28.92 0.35%  
TAML(Entropy + MAML)  49.33 1.8%  66.05 0.85%      
TAML(Theil + MAML)  49.18 1.8%  65.94 0.9%  18.74 0.65%  25.77 0.33%  
TAML(GE(2) + MAML)  49.13 1.9%  65.18 0.9%  18.22 0.67%  24.89 0.34%  
TAML(Atkinson + MAML)  48.93 1.9%  65.24 0.91%      
TAML(GE(0) + MAML)  48.73 1.8%  65.71 0.9%  18.95 0.68%  24.53 0.33%  
TAML (VL + MAML)  49.4 1.9%  66.0 0.89%  18.13 0.64%  25.33 0.32%  
TAML(GE(0) + MetaSGD)      18.61 0.64%  29.75 0.34%  
TAML (VL + MetaSGD)      18.59 0.65%  29.81 0.35% 
5 Conclusion
In this paper, we proposed a novel paradigm of TaskAgnostic MetaLearning (TAML) algorithms to train a metalearner unbiased towards a variety of tasks before its initial model is adapted to unseen tasks. Both an entropybased TAML and a general inequalityminimization TAML applicable to more ubiquitous scenarios are presented. We argue that the metalearner with unbiased taskagnostic prior could be more generalizable to handle new tasks compared with the conventional metalearning algorithms. The experiment results also demonstrate the TAML could consistently outperform existing metalearning algorithms on both fewshot classification and reinforcement learning tasks.
References
 [1] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In In International Conference on Learning Representations (ICLR), 2017.
 [2] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. CoRR, abs/1703.03400, 2017.
 [3] H. Theil. Economics and information theory. Studies in mathematical and managerial economics. NorthHolland Pub. Co., 1967.
 [4] Frank A. Cowell. Generalized entropy and the measurement of distributional change. European Economic Review, 1980.
 [5] Anthony B Atkinson. On the measurement of inequality. Journal of Economic Theory, 1970.
 [6] Paul D. Allison. Measures of inequality. American Sociological Review, 1978.

[7]
Efe A. Ok and James Foster.
Lorenz Dominance and the Variance of Logarithms.
Technical report, C.V. Starr Center for Applied Economics, New York University, 1997.  [8] Jurgen Schmidhuber. Evolutionary principles in selfreferential learning. on learning now to learn: The metametameta…hook. Diploma thesis, Technische Universitat Munchen, Germany, 14 May 1987.
 [9] D. K. Naik and R. J. Mammone. Metaneural networks that learn by learning. In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, volume 1, pages 437–442 vol.1, Jun 1992.
 [10] Sebastian Thrun and Lorien Pratt, editors. Learning to Learn. Kluwer Academic Publishers, Norwell, MA, USA, 1998.
 [11] Ke Li and Jitendra Malik. Learning to optimize. CoRR, abs/1606.01885, 2016.
 [12] Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. CoRR, abs/1606.04474, 2016.
 [13] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Metasgd: Learning to learn quickly for few shot learning. CoRR, abs/1707.09835, 2017.
 [14] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.
 [15] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. Oneshot learning with memoryaugmented neural networks. CoRR, abs/1605.06065, 2016.
 [16] Tsendsuren Munkhdalai and Hong Yu. Meta networks. CoRR, abs/1703.00837, 2017.
 [17] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for oneshot image recognition. 2015.
 [18] Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. CoRR, abs/1606.04080, 2016.
 [19] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, 2017.
 [20] Harrison Edwards and Amos Storkey. Towards a Neural Statistician. 2 2017.
 [21] Lukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. CoRR, abs/1703.03129, 2017.
 [22] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.
 [23] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, May 1992.
 [24] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
Comments
There are no comments yet.